Let’s start by creating a project folder and initialising it with a
mkdir scraper cd scraper npm init -y
We’ll be using two packages to build our scraper script.
- axios – Promise based HTTP client for the browser and node.js.
- cheerio – Makes it easy to work with the DOM (similar to jQuery).
Install these packages by running the following command:
npm install axios cheerio --save
Next create a file called
scrape.js and import the packages we just installed:
In this example we’ll be using Hacker News as the data source to be scraped.
When Inspecting the source code you’ll see that the site name in the header has a
hnname class. Let’s write a test script to see if we can fetch the source code (without being blocked) and grab the site name text.
Add the following to
scrape.js to fetch the data and log the text if successful:
Run the script and you should see
Hacker News logged in the terminal:
node scrape.jsCode language: CSS (css)
If everything’s working we can proceed to scrape some actual content from the website.
Let’s get the titles, domains and points for each story on the homepage:
This code loops through each of the
athing table rows, grabs the data, and then saves it in to an array called
stories. If you’ve worked with jQuery before then the selectors used to grab the text will be familiar, if not you can learn about them here.
node scrape.js and you should see the data for each of the stories: