Scrape client-side rendered HTML content with JavaScript

Last modified December 13th 2022 | Source Code [GitHub] | #node

In a previous tutorial I wrote about scraping server-side rendered HTML content. Many modern websites however are rendered client-side so a different approach to scraping them is required.

Enter Puppeteer a Node.js library for running a headless Chrome browser. This allows us to scrape content from a URL after it has been rendered as it would in a standard browser.

Before beginning you’ll need to have Node.js installed.

Let’s get started by creating a project folder, initialising the project and installing the required dependencies by running the following commands in a terminal:

mkdir scraper
cd scraper
npm init -y
npm install puppeteer cheerio

cheerio – is an implementation of core jQuery designed specifically for the server. It make’s selecting elements from the DOM easier as we can use the familiar jQuery syntax.

Next create a new file called scrape.js and load in the dependencies:

const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const fs = require("fs");Code language: JavaScript (javascript)

fs – Is a Node.js module that enables interacting with the file system which we’ll use to save the scraped data into a JSON file.

Then add a getData() function will we’ll launch a browser using Puppeteer, fetch the contents of a URL and call a processData() function that’ll process the page content:

async function getData() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://www.reddit.com/r/webdev/");
  const data = await page.content();
  await browser.close();  
  processData(data);
}
getData();Code language: JavaScript (javascript)

With the page content scraped let’s setup the processData() function. Here we use cheerio to select only the content we require (username, post title and number of votes):

function processData(data) {
  console.log("Processing Data...");
  const $ = cheerio.load(data);
  const posts = [];
  $(".Post").each(function () {
    posts.push({
      user: $("._2tbHP6ZydRpjI44J3syuqC", this).text(),
      title: $("._eYtD2XCVieq6emjKBH3m", this).text(),
      votes: $("._1E9mcoVn4MYnuBQSVDt1gC", this).first().text(),
    });
  });
  fs.writeFileSync("data.json", JSON.stringify(posts));
  console.log("Complete");
}Code language: JavaScript (javascript)

This code loops through each of the .Post elements, grabs the data we specified (Reddit doesn’t use human readable class names hence the long strings of random characters), and pushes it to a posts array.

Once each of the posts has been processed a data.json file is created using fs.writeFileSync. You can now run the script using node scrape.js. It’ll take a little while to complete, once finished browse to the project folder and you’ll see the data.json file complete with data.

Build a static website with Node.js, Express, and Pug

Build a REST API with Node.js, Express, and MongoDB

How to capture website screenshots using Node.js and Puppeteer

#AD Shop Web Developer T-Shirts

Scrape client-side rendered HTML content with JavaScript

Related Posts

#AD Shop Web Developer T-Shirts