I have a Wordpress gallery plugin that I maintain that uses the Lightgallery plugin on the frontend. Lightgallery has functionality that can create the gallery dynamically (by passing in an array of image urls).
I’d like to incorporate functionality that will parse the main HTML content on the page (the main body of the article, in this case) and use a regex to extract any tags, which I would then pass back to the backend via ajax. Then the backend will query Wordpress to find those images in the database (to retrieve certain details about them such as the copyright info, and description, etc). The backend would then return the necessary details and the gallery will be initialized.
Would using javascript to parse the HTML content (which might be a few/several MB) be too slow/resource intensive for low-end devices to handle?
Probably not, but if your HTML content is “several MB”, your HTML content might be too slow/resource intensive.
Keep in mind that the HTML content is just the text and the tags, it’s not the images you load. <img src="this_image_is_20_MB_big.jpg"> has an HTML content weight of 39 bytes (or 78 if you’re using badly formed multibytes). Not KB, not MB, bytes.
The entire bible (UTF-8, KJV, No references/footnotes) in plaintext is 4.2 MB [https://www.gutenberg.org/ebooks/10]. If your HTML content is bigger than that… I have questions.
Regexes do not parse. Regexes can recognize portions of text but they do not recognize the entire grammar. Experts advise to not use regexes on HTML.
There are compiler generators that are used to compile many things, including HTML, CSS and PHP. For PHP RE2C is used. The earliest generators that are still popular are Lex (Flex) for syntax and YACC (Bison) for semantics.