Deep Tech Point - learn tech easy way

There are different methods for website data scrapping and importing it into Excel. In this tutorial we will show you how to do it with Puppeteer, Node JS package. The method using Puppeteer might look scary at first but it’s actually quite easy and what is the most important you can get anything you want from the web page very accurately. Knowing JavaScript is useful but interestingly it’s something you can learn quickly up to the point to be able to scrape data.

Actually knowing basics of CSS is even more important because making query selectors will be your major task for scrapping. So now it’s the time to get an idea about CSS selectors in our tutorial. It’s also advisable to familiarize yourself with JavaScript function querySelector examples.

But how you will know what exact CSS selectors you need? The answer is you will look into web page HTML source code or even more precise you will look into DOM (Document Object Model) structure. Easiest method to see DOM structure of web page is to use Developer tools which is available on all major web browsers. You can access developers tools by right clicking some element on the webpage (for example news title), and click Inspect option which will automatically open Developer tools and display DOM structure of the element. Sometimes it’s possible to find selector you need in the HTML source code but nowadays it’s not useful most of the times because JavaScript change initial HTML code and it’s not possible to know which classes to select from the beginning. But once the web page is fully loaded you have more or less stable DOM structure.

Let’s get quickly through the new project creation and Puppeteer installation. If you need to read more about Node JS and Puppeteer installation using NVM now it’s the right time to do so.

# mkdir title-scraper
# cd title-scraper
# npm init -y
Wrote to /home/ubuntu/title-scraper/package.json:

{
  "name": "title-scraper",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

# npm install puppeteer
# npm install write-excel-file
# nano index.js

We also installed Node JS Excel manipulation library, exceljs. There are many other Excel libraries you could also try. It depends on what exactly you need to do in Excel because you could also save your data in CSV format which can be opened in Excel as well and does not require any external helper package. But if you know you’ll work data mining or any kind of analysis in Excel, you can make your life easier by saving data with specialized Excel package where you can create different spreadsheets and modify Excel document when you have a need.

Now we are ready to add code to index.js file we have been created with nano editor.

const excel = require('exceljs');
const puppeteer = require('puppeteer');

(async function() {
    const browser = await puppeteer.launch({
        headless: true,
        slowMo: 500
    });
    const page = await browser.newPage();
    try {
        await page.goto('https://edition.cnn.com/');
        await page.waitForSelector('h2.screaming-banner-text');
        const headlines = await page.evaluate(() => {
                const titles = document.querySelectorAll('section + section h3 .cd__headline-text');
                return Array.from(titles).map(l =>
                        l.innerText);
        });

        const wb = new excel.Workbook();
        wb.creator = 'Me';
        wb.lastModifiedBy = 'Her';
        wb.created = new Date(2014, 4, 30);
        wb.modified = new Date();
        wb.lastPrinted = new Date(2016, 9, 27);
        const ws = wb.addWorksheet('CNN Headlines');
        ws.views = [
                {state: 'frozen', ySplit: 1}
        ];
        ws.getRow(1).font = { bold: true };
        ws.columns = [
                {header: 'Headline', key: 'hd', width: 150}
        ];

        headlines.forEach(item => {
                ws.addRow([item]);
        });

        await wb.xlsx.writeFile('cnn-headlines.xlsx');
    }
    catch(err) {
        console.log(err);
    }
    browser.close();
}) ();

If this is your first time you are working with Puppeteer scrapers you should read tutorial on how to scrape website.
In this code we’re scraping CNN headlines with particular CSS query selector, section + section h3 .cd__headline-text. It means skip first section tag since there’s some redundant data in first section we don’t need. Then select all other sections’s h3 tags, particularly all tags with .cd__headine-text classes. It might look a bit difficult at first to write right query selector but it’s usually trial and error process and becomes easier and more clear with practice.

Now that we know how to extract data from particular webpage we are going to save it to the Excel sheet.
First we are going to create Excel workbook and worksheet and add initial configuration, like adding workbook properties (creator, lastModifiedBy, created, etc), adding columns, setting font weight, setting state of the first row to frozen, and similar formatting.

const wb = new excel.Workbook();
wb.creator = 'Me';
wb.lastModifiedBy = 'Her';
wb.created = new Date(2014, 4, 30);
wb.modified = new Date();
wb.lastPrinted = new Date(2016, 9, 27);
const ws = wb.addWorksheet('CNN Headlines');
ws.views = [
    {state: 'frozen', ySplit: 1}
];
ws.getRow(1).font = { bold: true };
ws.columns = [
    {header: 'Headline', key: 'hd', width: 150}
];

After we added scraped headlines in worksheet rows it’s time to save all to Excel file on hard disk.

headlines.forEach(item => {
    ws.addRow([item]);
});
await wb.xlsx.writeFile('cnn-headlines.xlsx');

Further improvements of this code could include opening existing Excel file and adding new worksheets, for example one worksheet par day, so you can keep track of headlines or anything else you scraped.

How to Scrape Data from a Website (CNN) into Excel