How to download the HTML code of a website with Node.js?

I have an array of links like this one:
['example1.com', 'example2.com', 'example3.com', 'example4.com', 'example5.com']

I want to get only the HTML source code of all the 5 links using Node.js. Please advise how can I do that?

You’ll need to look up a technique called web scraping I think.

You should be able to adapt this to your needs.

1 Like

W3Schools is a good place to start. The site has some great tutorials for Web technologies which are easy to follow.

It can be. But it is quite outdated on some topics. I’m not sure it is a good choice for Node.

If it is static HTML, and you just need to download HTML files, then you need to use the npm library request.

It it is dynamic HTML, generated by a library or HTML written at runtime, you need to use a browser emulator like selenium-webdriver. And then you can extract the innerHTML from the <html> element. another solution would be to use a mocked DOM, like JSDOM.

Once the page is loaded in selenium, the extraction of the HTML looks like this:

await driver.findElement(By.id('react-application-root'))
const searchFormButton = await driver.findElement(By.id('search-form-button'))
await searchFormButton.click()

// we wait for the search results to be displayed
const timeoutToDisplaySearchResults = 5000
await driver.wait(until.elementLocated(By.id('tab0')), timeoutToDisplaySearchResults)
const htmlElement = await driver.findElement(By.tagName('html'))
const htmlElementinnerHTML = await htmlElement.getAttribute('innerHTML')
const fullHtml = '<!DOCTYPE HTML>\n<html lang="en">\n' + htmlElementinnerHTML + '</html>'
1 Like

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.