#AD 50 Projects In 50 Days (HTML, CSS & JavaScript) - Sharpen your skills by building 50 mini projects!

Scrape client-side rendered HTML content with JavaScript

By Michael Burrows | Last modified October 6th 2020 | GitHub Source Code [GitHub]

In a previous tutorial I wrote about scraping server-side rendered HTML content. Many modern websites however are rendered client-side so a different approach to scraping them is required.

Enter Puppeteer a Node.js library for running a headless Chrome browser. This allows us to scrape content from a URL after it has been rendered as it would in a standard browser.

Before beginning you’ll need to have Node.js installed.

Let’s get started by creating a project folder, initialising the project and installing the required dependencies by running the following commands in a terminal:

mkdir scraper cd scraper npm init -y npm install puppeteer cheerio
  • cheerio – is an implementation of core jQuery designed specifically for the server. It make’s selecting elements from the DOM easier as we can use the familiar jQuery syntax.

Next create a new file called scrape.js and load in the dependencies:

const puppeteer = require("puppeteer"); const cheerio = require("cheerio"); const fs = require("fs");
Code language: JavaScript (javascript)
  • fs – Is a Node.js module that enables interacting with the file system which we’ll use to save the scraped data into a JSON file.

Then add a getData() function will we’ll launch a browser using Puppeteer, fetch the contents of a URL and call a processData() function that’ll process the page content:

async function getData() { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto("https://www.reddit.com/r/webdev/"); const data = await page.content(); await browser.close(); processData(data); } getData();
Code language: JavaScript (javascript)

With the page content scraped let’s setup the processData() function. Here we use cheerio to select only the content we require (username, post title and number of votes):

function processData(data) { console.log("Processing Data..."); const $ = cheerio.load(data); const posts = []; $(".Post").each(function () { posts.push({ user: $("._2tbHP6ZydRpjI44J3syuqC", this).text(), title: $("._eYtD2XCVieq6emjKBH3m", this).text(), votes: $("._1E9mcoVn4MYnuBQSVDt1gC", this).first().text(), }); }); fs.writeFileSync("data.json", JSON.stringify(posts)); console.log("Complete"); }
Code language: JavaScript (javascript)

This code loops through each of the .Post elements, grabs the data we specified (Reddit doesn’t use human readable class names hence the long strings of random characters), and pushes it to a posts array.

Once each of the posts has been processed a data.json file is created using fs.writeFileSync. You can now run the script using node scrape.js. It’ll take a little while to complete, once finished browse to the project folder and you’ll see the data.json file complete with data.

Related Posts