Scraping of webpages is really simple and elegant with Puppeteer. Let's try to scrape Codesnacks and get all the links on the page with anchor and text.
We can easily do this using puppeteer. There's no need to fetch the data first and parse it. You can just let puppeteer visit the page and run your own Javascript in the context of the page. The best way to do this is to first run it in the console of your browser and the just copy it to the code if you made sure everything works as planned.
// npm i puppeteer
const puppeteer = require("puppeteer");
// we're using async/await - so we need an async function, that we can run
const run = async () => {
// open the browser and prepare a page
const browser = await puppeteer.launch();
const page = await browser.newPage();
// open the page to scrape
await page.goto("https://codesnacks.net");
// execute the JS in the context of the page to get all the links
const links = await page.evaluate(() =>
// let's just get all links and create an array from the resulting NodeList
Array.from(document.querySelectorAll("a")).map(anchor => [anchor.href, anchor.textContent])
);
// output all the links
console.log(links);
// close the browser
await browser.close();
};
// run the async function
run();
Before there was puppeteer there were several tools, that you had to stitch together.
- a library to fetch the document (e.g. axios or node-fetch)
- a parser to parse the HTML and access the DOM nodes (e.g. cheerio)
The problem with this approach was, that dynamically rendered pages were even harder to scrape. That's no issue with puppeteer, since it's actually using chrome - just headless.