Web Scraping with PHP

A quick look at web scraping with PHP.

In the last blog, we covered how Python can be used to scrape websites and we also took a look at some of the most important Python libraries in the context of web scraping.

Python is often used in web scraping as it is one of the most accessible programming languages available today. However, you can use several different programming languages to get the same job done too.

The subject of this blog covers web scraping in a language that is far less accessible – PHP.

PHP is the most popular server-side programming language used today but it is not uncommon for programmers to dread it to a certain degree due to various issues. However, we’ll keep the focus sharp on this blog and figure out how PHP can be used to scrape websites.

What are HTTP requests?

We went into detail about how HTTP worked in the previous post but here are the basic bullet points:

  • HTTP is the most commonly used protocol on the application layer of the Internet Protocol which is also known as the TCP/IP.
  • HTTP uses requests and responses to facilitate communication between clients and servers. The client sends a “request” and the server sends back a “response”.
  • Since we’re working on scrapping websites, we’ll be using HTTP to send requests for data and get data back in responses from the server.

Here’s an example of how HTTP works. When you visit a website using a browser, the browser sends a request which, in the most basic form looks something like this:

GET / HTTP/1.1
Host: www.google.com

In the above request, we’re essentially telling the server that we’re sending a GET request for the homepage of the host, which in this case is “www.google.com”. When the request is sent by the browser, the server sends back a response that contains the HTML page that would be rendered by your browser.

Using HTTP requests with PHP

Now that we have a good idea of how HTTP works, let’s use PHP to send a request! In this example, we’ll be using a very basic PHP function – fsockopen() to send the request. In actual production code, this would not be a good idea to use because there are better ways to do this in PHP that we’ll explore later.

However, if you’re new to PHP, you can see how fsockopen() can be used to send a HTTP request and receive the response in the example below:

<?php
# fsockopen.php

// First, we’ll create the header of our request
$request = "GET / HTTP/1.1\r\n";
$request .= "Host: www.google.com\r\n";
$request .= "\r\n";

// Then we open a connection to www.google.com and start
// reading the information being sent back
$connection = fsockopen('www.google.com', 80);
fwrite($connection, $request);

// We create a while loop to print the response as long as the server is 
// sending something back
while(!feof($connection)) {
echo fgets($connection);
}

// We terminate the connection once the server stops sending data
fclose($connection);

Again, the example provided above is a very basic way of handling requests and responses. Ideally, you want to use something that is easier to scale and does not require a lot of boilerplate code that can be time-consuming to write.

To solve this, we’re going to take a look at a more robust tool that you can use to handle requests and responses in PHP – cURL.

cURL stands for “a client for URLs” and you can think of it as a more robust and user-friendly version of fsockopen().

With that being said, let’s take a look at how we can use cURL to handle requests and responses:

<?php
# curl.php

// Creating a cURL handle and initializing a connection
$ch = curl_init();

// Instead of manually creating your headers, you 
// can just use simple functions to modify the headers

// Setting the URL
curl_setopt($ch, CURLOPT_URL, 'http://www.google.com');

// Setting the HTTP method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, ‘GET’);

// This setting determines what is done with the response
// We’re setting it to return the response instead of just
// printing it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// Storing the response in $response
$response = curl_exec($ch);

// Printing the response
echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;

// Closing the cURL handle 
curl_close($ch);

By using cURL, you have a lot more control over how the process is handled, and a lot of boilerplate stuff is already taken care of.

Now let’s take a look at a real web scraping example with PHP and also how to handle strings and regular expressions in PHP.

Parsing Web Pages using strings and regular expressions in PHP

For our first example of web scraping with PHP, we’ll be extracting data from this page. We’re using this page because the data on the page is well-structured. To read the web page, we’ll first send a request and get the response back. To do this, we can use the following code:

<?php
# wikipedia.php

$html = file_get_contents('https://www.w3schools.com/html/html_tables.asp');
echo $html;

You might have noticed that instead of using cURL, we use the function – file_get_contents(). This is not exactly the right function to use when web scraping with PHP but it can be used for small scripts or if you want to get a simple job quickly done.

The output of the example script above should be the HTML code of the page. If you look through the response, you will notice the following section that is the code for the table we want to extract data from.

<table id="customers">
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
  <tr>
    <td>Ernst Handel</td>
    <td>Roland Mendel</td>
    <td>Austria</td>
  </tr>
  <tr>
    <td>Island Trading</td>
    <td>Helen Bennett</td>
    <td>UK</td>
  </tr>
  <tr>
    <td>Laughing Bacchus Winecellars</td>
    <td>Yoshi Tannamuri</td>
    <td>Canada</td>
  </tr>
  <tr>
    <td>Magazzini Alimentari Riuniti</td>
    <td>Giovanni Rovelli</td>
    <td>Italy</td>
  </tr>
</table>

The entire point of web scraping is that we can program scripts and applications that can read an HTML response and parse the data to find the information that we need. So, let’s use PHP to parse the response we received.

Building on the example posted above, we use strings to extract a section of the HTML response that we need:

<?php

$html = file_get_contents('https://www.w3schools.com/html/html_tables.asp');

$start = stripos($html, 'id="customers"');

$end = stripos($html, '</table>', $offset = $start);

$length = $end - $start;

$htmlsection = substr($html, $start, $length);

echo $htmlsection;

The output of this script is a section of the HTML response. However, do note that we’re only extracting the section using strings. The output is no longer a valid HTML response and it is better to think of it as just text at this point.

In order to parse the response into the format that we need, we’ll be using regular expressions and extend the example like this:

<?php

$html = file_get_contents('https://www.w3schools.com/html/html_tables.asp');

$start = stripos($html, 'id="customers"');

$end = stripos($html, '</table>', $offset = $start);

$length = $end - $start;

$htmlsection = substr($html, $start, $length);

# echo $htmlsection;

preg_match_all('@<td>(.+)</td>@', $htmlsection, $matches);
$listItems = $matches[1];

foreach ($listItems as $item) {
    echo "{$item}\n\n";
}

The output of the script now should be something like this:

Alfreds Futterkiste

Maria Anders

Germany

Centro comercial Moctezuma

Francisco Chang

Mexico
…

Let’s say we want to just extract the list of company names and not the contact or country. To do so, we’ll modify the script to only print every 3rd item in the list starting from 0. Here’s how we’ll do it:

<?php
# wikipedia.php

$html = file_get_contents('https://www.w3schools.com/html/html_tables.asp');

$start = stripos($html, 'id="customers"');

$end = stripos($html, '</table>', $offset = $start);

$length = $end - $start;

$htmlsection = substr($html, $start, $length);

preg_match_all('@<td>(.+)</td>@', $htmlsection, $matches);
$listItems = $matches[1];

$counter = -1;
foreach ($listItems as $item) {
    if ($counter++ == 0){echo "{$item}\n";}
    if ($counter++%3 == 0){echo "{$item}\n";}
}

In the above example, we modified the loop to only print the names of the companies based on their position in the $listItems array. However, as you can clearly see, this way of doing things can get messy really fast.

To write better PHP code for web scraping, we’ll be using more advanced tools like Guzzle, XML, and XPath that make things a bit easier for you when it comes to web scraping.

Using XML, XPath, and Guzzle

Guzzle is one of the most widely used HTTP clients for PHP. It makes it really easy for you to handle HTTP requests and responses while providing additional useful features such as error handling, a user-friendly API, and more flexibility in general.

To install Guzzle, run the following command on your terminal:

composer require guzzlehttp/guzzle

In order to explain how you can use Guzzle to extract data from the web with PHP, let’s take a look at an example. Let’s say we want to extract the area of Afghanistan from this page – http://example.webscraping.com/places/default/view/Afghanistan-1.

If we quickly take a look at the HTML code for the page, we find that the field we want to exact is hidden within a few nested HTML tags. To make parsing the data easier for us, we’ll be using XPath.

XPath is a document query language that is used widely in web scraping as it allows you to quickly select nodes in a DOM document.

Here’s an example of how we can use Guzzle to send a request and receive a request, and then we use XPath to parse the precise field that we’re looking for:

<?php
# country_size.php

require 'vendor/autoload.php';

$httpClient = new \GuzzleHttp\Client();

$response = $httpClient->get('http://example.webscraping.com/places/default/view/Afghanistan-1');

$htmlString = (string) $response->getBody();

libxml_use_internal_errors(true);

$doc = new DOMDocument();
$doc->loadHTML($htmlString);

$xpath = new DOMXPath($doc);

$area = $xpath->evaluate('//table[1]/tr[2]/td[2]');

echo $area

In the example provided above, we create a Guzzle instance as the “$httpClient”. We use the client to send the request and get the response back from the website. We use the libxml_use_internal_errors functions to suppress minor HTML errors that can get in the way.

After that, we create $doc which is an instance of a DOM document and then we fill it with the response that we received from the server.

Then we use XPath to travel through the document and find the precise node that we’re looking for. In this case, we’re looking for the value in the 2nd tag nestled in the 2nd tag on the 1st table on the page. That’s why we use ‘//table[1]/tr[2]/td[2]’ in the example above.

This is a very basic example of XPath and it can do a lot more that we’ll explore in future blogs. For the time being, let’s take a look at another very popular HTML client for PHP – Goutte.

Goutte

Goutte is another HTTP client that you can find in the vast PHP ecosystem. While it is not the most downloaded HTTP client, it is still very popular.

What makes Goutte really great for web scraping is that you can leverage the power of different components of the Symfony framework, created by the same developer behind Goutte. Some of the components of the Symfony framework that are really useful for web scraping include the DomCrawler component, the HTTP Client, the BrowserKit component, and more.

To install Goutte, you need to run the following command in the terminal:

composer require fabpot/goutte

Let’s use Goutte to recreate the last example provided in the section above. With Goutte, there are certain advantages in this example such as we can use the crawler to evaluate the XPath rather than doing it manually. Here’s the example:

<?php
# goutte_xpath.php

require 'vendor/autoload.php';

$client = new \Goutte\Client();

$crawler = $client->request('GET', 'http://example.webscraping.com/places/default/view/Afghanistan-1');

$area = $crawler->evaluate('//table[1]/tr[2]/td[2]');

echo $area;

As demonstrated above, the code is simplified significantly with Goutte. We also don’t have to deal with creating a DOM document and then using XPath. Instead, we can just use the user-friendly functions of the HTTP client to get the job done!

Do note that you might get some errors if you use Goutte without installing the masterminds/html5 library due to some problems that Goutte might have with parsing HTML5. You can install the library with the following command in the terminal:

composer require masterminds/html5

Symfony Panther and headless browsing with PHP

In all the examples that we’ve used in this blog so far, we download the HTML code of the page from the server and then parse through the code to find what we’re looking for.

However, most modern websites use JavaScript to dynamically update the content on the page that we see. If we’re just downloading the static HTML code from a modern website that uses JavaScript, it is possible that the code we downloaded would be different from the “live” version of the website.

To solve this problem, we need to use headless browsers. A headless browser, like the name implies, is basically the raw engine of a web browser that you can program to do what you want. This way, you can scrape data from websites that use JavaScript too!

It should be noted that Goutte does offer some limited functionality as a headless browser but it is not the best option if you’re looking particularly for a PHP headless browser.

Symfony Panther is a standalone library that is one of the headless browsers available in the PHP ecosystem. It is also a part of the same framework as Goutte, it is also really easy to work with if you’re already used to Goutte.

To finish the article off, we’ll modify our previous example so that we not only get the data we’re looking for but we also save a screenshot of the webpage.

<?php
# goutte_xpath.php

require 'vendor/autoload.php';

$client = new \Goutte\Client();

$crawler = $client->request('GET', 'http://example.webscraping.com/places/default/view/Afghanistan-1');

$area = $crawler->evaluate('//table[1]/tr[2]/td[2]');

echo $area;

$client = \Symfony\Component\Panther\Client::createFirefoxClient();

$client
  ->get('http://example.webscraping.com/places/default/view/Afghanistan-1')
    ->takeScreenshot($saveAs = 'screenshot.jpg');

Web Scraping with PHP can be fun!

The objective of this blog was to serve as a quick introduction to the world of PHP. Instead of going into too many details, we took a look at a few different ways of dealing with web scraping using PHP.

If you want to create a production-level web scraping script with PHP, you might run into a number of issues such as rate-limiting, blocking of your IP address, and dealing with frustrating complexities when dealing with headless browsers for different uses.

If you’re looking for an easier way to deal with these problems, you can take a look at Scrapper and how it can make things easier for you.

Leave a Reply

Your email address will not be published. Required fields are marked *

Have a question?

A problem, a question, an emergency?
Do not hesitate to visit the help centre, we are here to help.

Powered by 🤖 Robots & ❤️️ Clive – Copyright © 2020 Scraper.dev