This is the follow-up to Web Scraping 101: Build a simple web scraper using PHP. If you haven’t seen it yet, I’d highly recommend you to read that before continuing.
In this tutorial, we will be building a simple web scraper that extracts the result from Anna University’s website. This tutorial is strictly for educational purposes.
We are going to do this in 4 simple steps:
Step 1: read the website and create the DOM
Anna University provides a nice little endpoint using which we can view our result by sending a GET Request.
If you have gone through the previous post, the below needs no explanation.
Going through the source of the site, we can see that structure of the site is
pretty messed up. If you have tried running the above code, you’d have noticed that
it throws many errors at
loadHTML() line. This is because libxml library used by
PHP throws an error when the HTML it receives is not structured properly.
To get over this, we use
libxml_use_internal_errors() function which stops the libxml from throwing errors to the standard error.
Step 2: Extract the data
As we have seen already, the structure of the site is pretty messed up. So, we
need to find a pattern to extract the relevant data. Going through the site once
more, we can see that all the fields we need are enclosed within a
To extract an element using its attribute, we can use
method takes in a XPath(XPath is a query language for selecting nodes from an XML document.)
and returns a
getElementsByTagName in previous post.
The XPath query we will be using is
The above XPath query is broken down below:
|//||makes the selection throughout the document.|
|td||selects all the
First, we have to create a
DOMXPath's constructor takes in
DOMDocument object. Then, we have to call the
query method with the above
In the above code, data inside each selected node is stored into an array.
Step 3: Clean the data
Not all the data that we have retrieved is useful to us. Before we proceed to generate excel, let us first sanitize the data a little more.
From the above code’s result, we can clearly see that, from
$extracted to the
end of the array, every
3Nth element is the subject code and
3N + 1th element
is the grade.
Step 4: Store it to Excel
There are various external libraries to do this, but for simplicity, we are doing this with raw PHP.
To generate an excel file, first we need to include some headers.
header("Content-Type: application/vnd.ms-excel"); header("Content-disposition: attachment; filename=result.xls"); header("Pragma: no-cache"); header("Expires: 0");
The explanation below can be skipped:
|Content-Type: application/vnd.ms-excel||Sets the MIME Type for Excel|
|Content-disposition: attachment; filename=result.xls||Sets the file as downloadable with the name
|Pragma: no-cache||When the no-cache directive is present in a request message, an application SHOULD forward the request toward the origin server even if it has a cached copy of what is being requested.|
|Expires: 0||The Expires header contains the date/time after which the response is considered stale. Invalid dates, like the value 0, represent a date in the past and mean that the resource is already expired.|
Now, anything to the standard output will be stored as
.xls. Each cell value
in a row is separated using tabs
\t and each row is separated using line breaks
echo 'Name' . "\t" . 'Phone' . "\n"; echo 'Joey Tribbiani' . "\t" . '12345678' . "\n";
$result to Excel:
That’s it, folks! We have successfully generated an Excel file with the scraped result. If you have any problem following the tutorial, please do post it in the comments below.
GITHUB link to the completed project: GITHUB