In a previous tutorial I wrote about scraping server-side rendered HTML content. Many modern websites however are rendered client-side so a different approach to scraping them is required.
Enter Puppeteer a Node.js library for running a headless Chrome browser. This allows us to scrape content from a URL after it has been rendered as it would in a standard browser.
Before beginning you’ll need to have Node.js installed.
Let’s get started by creating a project folder, initialising the project and installing the required dependencies by running the following commands in a terminal:
mkdir scraper cd scraper npm init -y npm install puppeteer cheerio
cheerio– is an implementation of core jQuery designed specifically for the server. It make’s selecting elements from the DOM easier as we can use the familiar jQuery syntax.
Next create a new file called
scrape.js and load in the dependencies:
fs– Is a Node.js module that enables interacting with the file system which we’ll use to save the scraped data into a JSON file.
Then add a
getData() function will we’ll launch a browser using Puppeteer, fetch the contents of a URL and call a
processData() function that’ll process the page content:
With the page content scraped let’s setup the
processData() function. Here we use cheerio to select only the content we require (username, post title and number of votes):
This code loops through each of the
.Post elements, grabs the data we specified (Reddit doesn’t use human readable class names hence the long strings of random characters), and pushes it to a
Once each of the
posts has been processed a
data.json file is created using
fs.writeFileSync. You can now run the script using
node scrape.js. It’ll take a little while to complete, once finished browse to the project folder and you’ll see the
data.json file complete with data.