Deep Tech Point - learn tech easy way

There are many ways how to scrape data from a website. You can do it with almost any programming language out there but with variable success. Nowadays it’s bit harder to be successful in website data scraping because many websites use advanced web technologies, progressive web and what not. In other words it’s not just parsing static HTML but getting access to DOM(Document Object Model) of webpage because it’s usually interactively modified meaning when you interact with webpage new parts of HTML are added and some parts removed on the fly.

For less dynamic webpages you can even use PHP for scrapping. Moreover some websites offer API access to their data so it really doesn’t matter what language you use.
If you plan data mining, deep learning or anything related to AI and machine learning the obvious choice would be Python programming language. There are lot of libraries with proven machine learning algorithms for Python and few for scraping like Scrapy. Scrapy is full framework for crawling website including managing requests, sessions preservation, output pipelining and more.

But as I said websites are these days increasingly turned into apps that work on the client side(web browsers) and the best technology for scraping is something that works best with browsers and that’s of course JavaScript. Moreover there is excellent NodeJS module Puppeteer which is used to automatically control an instance of the web browser through the DevTools Protocol. You can run multiple instances of browsers in headless mode controlling everything, from viewport size and proxy settings to interaction with web page elements and scraping data.

Probably the best solution is to combine different solutions for different tasks. Especially if you want clean and specific data in your databases you would want Puppeteer or similar tool, like Selenium. Then you can do data science processing with tools better suited for that task such as NumPy, SciKit-Learn, SciPy, TensorFlow etc.

Without further ado lets see how to install Puppeteer for data scraping. First you should install NodeJS from nodejs.org if you haven’t already. You can check if you have it on your system open command prompt and type:

# node --version
v14.8.0

which should display node version or error if node is not installed.
I won’t go into node installation procedure since it’s operating system specific but in any case should be straight forward.
Then after you have node on your system type:

# mkdir test-scraper
# cd test-scraper
# npm init -y
Wrote to \Users\dtp\repos\test-scraper\package.json:

{
  "name": "test-scraper",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

This is just creating new node project in its dedicated directory with index.js as entry point. Project data is stored in package.json and you should create index.js to put your scraper’s code inside. Now install puppeteer with:

# npm install puppeteer

You could also specify version of puppeteer for installation to be sure that some older code or Puppeteer plugin/package, you might play with while learning, works. So you could type something like this:

# npm install puppeteer@">=8.0.0 <9.0.0"

which means that we want the Puppeteer version greater than or equal to 8.0.0 and less than version 9.0.0.

Now we are going to write simple web scraper that will get CNN's screaming news title. This is one on the top of the webpage with biggest font.

const puppeteer = require('puppeteer');

(async function() {
    const browser = await puppeteer.launch({
        headless: false,
        slowMo: 500
    });
    const page = await browser.newPage();
    try {
        await page.goto('https://edition.cnn.com/');
        await page.waitForSelector('h2.screaming-banner-text');
        const headline = await page.evaluate(() => {
            return document.querySelector('h2.screaming-banner-text').innerText;
        });
        console.log('Current CNN headline: ', headline);
    } 
    catch {
        console.log('Unable to get headline from CNN website.');
    }
    browser.close();
}) ();

In first line of the code we require(d) Puppeteer package. Then we created asynchronous self invoking function:

(async function() {
    ...
}) ();

, in which we run the rest of the code. Self invoking function means it will be immediately run when we run app with # node index.js. The JavaScript concept of async means that some of code is asynchronous like network operations in this case. For example .goto function need some time to load website and in meantime we could do something else. But using await we are essentially 'waiting' that code to make connection with webpage. Next, we need to 'wait' for selector. This means webpage code is loading but we are still not sure if it loaded until some element with specific selector is loaded. It's usually selector of interest we are waiting to appear in DOM. It's important to note that specific element doesn't need to exist in HTML code when we read it from web server. Many times it's added dynamically by JavaScript code which is loaded with initial HTML and run in the browser. Finally, when the selector appears we will evaluate our own JavaScript code document.querySelector('h2.screaming-banner-text').innerText to get text of an element of interest (h2 tag with .screaming-banner-text class).

We use launch function to launch browser(Chromium) instance and setup some properties, in our case if we are going to launch headless browser(think of same browser running but with no visible window) or not. Property headless of config object is set to false, which means we will see the window. It is useful during the project testing, later we can set it to true or just delete headless property because its default value is true. browser.newPage() is self explanatory, it creates new page in browser instance which we are going to control, and browser.close() will close the browser when we are finished with scrapping.

Try and catch block is useful because if some of the networks functions fail our app won't crash as well which is always good at least to display errors and help us to track down the bug in the code.

Now we have our first scrapper up and running. Keep in mind that website owners change HTML structure from time to time and at that point you will have to change your scrapper's code as well. Most of the times it will be on the level of only changing selectors.

How to scrape data from a website