Deep Tech Point
first stop in your tech adventure

PHP scraping and parsing for beginners without using regular expressions

February 7, 2022 | PHP

If you are a complete scraping beginner, this article is for you. If not, move on, because we are going to learn about scraping and parsing, but without using regular expressions or any supporting libraries. Moreover, we will do it using basic PHP string functions. Sure, PHP is not the best language choice for scrapping and parsing but it’s easy to learn and comprehend so it is a good candidate for beginners. First, we are going to explain what parsing is. Then, we are going to have a look at arrays, the GET method, stream_context_create and file_get_contents functions. Afterward, we are going to apply the strpos and substr functions, and at the end of this mini-project, we are going to display the results using print_r function.
Roll your sleeves PHP beginner, we are diving in.

We are sure you’ve heard of a website, called everydayhealth.com. This is not about promoting them, it’s about taking them as an example of how to scrap and then parse content in PHP from their website without having the necessary knowledge of regular expressions. Eventually, in the next article, we could take a look at how to parse the same content by using regular expressions.

What are scraping and parsing in PHP?

In programing terminology, data scraping is a technique where a computer program extracts data from let’s say a website. On the other hand, the term parsing originates from a Latin pars (orationis) and translates to a “part of speech.” When we deal with parsing, either in computer science, analysis of data, or even in natural language, we are dealing with syntax analysis. And syntactic analysis is all about analyzing a string of symbols. In terms of PHP or any other computer language, for that matter, when we deal with a string (a chunk of characters) parsing, we take a string from a file and extract the exact information we want. In our example, we are going to view the source code of the everydayhealth.com and from that chunk of HTML (which is essentially text data) we are going to parse URLs and accompanying titles of one section on their website. Let’s have a look at the code we wrote:

 array(
    'method' => "GET",
    'header' => "Accept-language: en\r\n" .
                "User-Agent: Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Mobile/15E148 Safari/604.1\r\n" // i.e. An iPad 
    )
  );
  
$context = stream_context_create($options);

$webpage = file_get_contents("https://www.everydayhealth.com", false, $context);

$urls = array();
$titles = array();

$urlStart = 0;
while($urlStart = strpos($homepage, '<a class="cr-anchor homepage-hero__article-link" href="', $urlStart)) {
    $urlStart = $urlStart + 55;

    $urlEnd = strpos($webpage, '"', $urlStart);
    $urls[] = substr($webpage, $urlStart, $urlEnd - $urlStart);

    $titleStart = $urlEnd + 61;
    $titleEnd = strpos($webpage, '</h3>', $titleStart);

    $titles[] = substr($webpage, $titleStart, $titleEnd - $titleStart);
}

print_r($urls);
print_r($titles);

?>

What are arrays in PHP and why do we need them when parsing?

In PHP, as well as in most other programming languages, an array is a special variable, that stores multiple values in that one, single variable. In PHP, we create an array with the array() function and we can create three types of arrays – indexed, associative and multidimensional arrays. In our parsing example, we are going to create two indexed arrays that we will print out at the end. One will stand for URLs that we will parse from the source ($urls = array();) and the other array will represent the accompanying headlines or titles or the URLs ($titles = array();).
You can also notice at the beginning of our example that we created an array of parameters, also called a stream context, which can be transmitted each time you read or write a stream through a socket, or in simpler terms when your PHP script connects to a web server that is serving a website (in our case everydayhealth website). What we’ve actually done is created a set of parameters that represent the http protocol to be used, the request method (GET), and additional http headers such as user agent signature. The most common user agents are web browsers which we use for surfing websites, such as Firefox, Chrome, Safari, etc. Each of these has a specific signature so the web server knows who is using its resources (loading web pages, images, videos, etc.) Usually, website owners want to prevent data scrappers, so they limit access only to known user agents such as web browsers. That’s why data scrappers sometimes set their own user agent signature to a known browser’s signature. These parameters will be used each time we open a socket connection to request everydayhealth.com website. This will save us a lot of time when we want to use these parameters aka stream context whenever we include them when making a request to everydayhealth.com, instead of having to specify these parameters over and over again.
In general, this part of the code is not something that you, as a data scrapper beginner should fully understand, or God forbid, write yourself. At this level of PHP understanding, and even later on when you deepen your PHP knowledge, simply copy-paste these few lines and just don’t bother with them.

What stream_context_create() and file_get_contents() functions have to do with parsing?

A variable $options is passed through a stream_context_create() function and defined as a $context variable, like so

  • $context = stream_context_create($options);
  • . The function stream_context_create() creates and returns a stream context.

    On the other hand,

  • file_get_contents()
  • returns the content of a file as a string. This file can be local server file or some hypertext resource in our case web page identified by URL instead of hard disk path. This looks like $webpage = file_get_contents(“https://www.everydayhealth.com”, false, $context);, and function’s result is passed to a variable called $webpage.

    Declaring the two variables that we define as arrays

    We’ve already mentioned that in the section about arrays – we define the two variables $urls and $titles that are btw also printed out at the end of this little script. Both variables are declared as an array():

    
    $urls = array();
    $titles = array();
    

    Applying strpos() and substr() functions when parsing a string

    Let’s take a look at the part of a source code on everydayhealth.com’s homepage:

    <a class="cr-anchor homepage-hero__article-link" href="https://www.everydayhealth.com/fitness/best-weight-loss-apps-every-need/" rel="noopener noreferrer" target="_self"><h3 class="homepage-hero__title">The 18 Best Apps for Weight Loss: Diet Plan Tools, Fitness Trackers, and More</h3></a>

    The above is a bit of code we’re looking at. In the next lines, we will apply strpos() and substr() functions to extract the exact information we want.
    We define $urlStart = 0;, because we are starting a string search from the beginning of a string (HTML document/text), and then we take while loop into a game. In that loop, we declare $urlStart as a strpos function, which helps us find the position of the first occurrence of a substring in a string. The first parameter we included in a function is a $webpage and presents a string to search in. The second parameter (<a class=”cr-anchor homepage-hero__article-link” href=”) presents a unique string, some sort of a unique pattern that we defined to be relevant for us to help us position a beginning of a URL we want to parse. This unique string has 55 characters and we included that as $urlStart = $urlStart + 55;.
    Now, we have to define where the URL ends and we do that by defining $urlEnd = strpos($webpage, ‘”‘, $urlStart);. In the second parameter, we defined character ” to be the end of an URL. In this line of code: $urls[] = substr($webpage, $urlStart, $urlEnd – $urlStart); we use PHP function substr to “parse” URL and add it to the $urls array.
    We applied a similar logic with titles that accompany the URLs we parsed. We defined $titleStart = $urlEnd + 61; because rel=”noopener noreferrer” target=”_self”><h3 class=”homepage-hero__title”> has 61 characters. Then we defined the variable $titleEnd through strpos() function where we stated that the second parameter of the strpos() function is

    , which is from our point of view unique, like so: $titleEnd = strpos($homepage, ‘

    ‘, $titleStart);. We will parse titles using substr in a similar fashion as we did URLs, like so:
    $titles[] = substr($homepage, $titleStart, $titleEnd – $titleStart);

    And here we are, at the end of the code, printing out all variables as arrays in a form or URLs and titles, like so print_r($urls);
    print_r($titles);
    .