There are times when we want to extract data from a website. In most cases, you are provided with an API, but that’s not always plausible. So, when a website does not provide an API, the only way to get the data from the website is to scrape it off yourself.
In this tutorial, we are going to build a simple scraper using PHP to extract data from Wikipedia (I’d highly recommend you to use Wikipedia API over this. This is just for the tutorial purpose.) and another scraper that extracts Anna University result.
There are various external libraries such as Simple HTML DOM and cURL, which I highly recommend you to check out after this tutorial, but we are going to proceed with simple raw PHP.
Before you proceed, remember that Web Scraping is definitely a gray area. So, make sure you are not breaking terms and conditions of the site you are scraping the data off. This tutorial is just for educational purpose.
Simple Wikipedia Scraper
We are going to do this in 3 simple steps:
Step 1: Get the content
First, we have to pull the content off the Wikipedia page. For this, we use file_get_contents()
function, which is used to retrieve the content of a file using its URL. It takes in
a URL of a file and returns its content.
The above code returns the content of the Wikipedia page as a string which the browser interprets as something like this:
Step 2: Parse the string as DOM
DOM stands for Document Object Model
.
It defines the logical structure of documents and the way a document is accessed
and manipulated. In simple words, it generates a tree like structure to access the
data in it. It is used by the browser to parse HTML into displayable content.
$url
can be converted to DOM using DOMDocument
class in PHP.
DOMDocument's
constructor takes in XML version
and encoding
as parameters.
Don’t worry if you don’t know what they are, just use the line below:
$dom = new DOMDocument('1.0', 'UTF-8');
The loadHTML()
method of
DOMDocument takes in the content and creates a DOM.
Note: If you’re new to PHP classes, they are very similar to the ones in C++ and Java.
The only difference you have to know to continue this tutorial is that PHP uses
->
instead of .
to access class members.
Step 3: Scrape the data
This is the part where you have to use your brain. Since every website is not built
the same way, you have to know your way through the website’s structure. To do this,
just try going through the website’s source to determine the appropriate selectors
to crack into the data.
For example,
if you go through https://en.wikipedia.org/wiki/Web_scraping
’s
source, you can see that the title has an ID of #firstHeading
.
We can retrieve a DOM Element with its ID using getElementById
method. getElementById()
method returns an object of DOMElement
type.
The content inside a DOMElement
is stored in nodeValue
attribute.
Next, We’re going to extract the first paragraph of the Wikipedia page. Going
through the source again, we can see that the first paragraph is within a p tag
inside #mw-content-text
.
getElementsByTagName
rerurns a DOMNodeList
object,
which is a collection of DOMElement
objects ordered, which can be accessed using
item()
method.
In the example, the first paragraph is returned by item(0)
. Like how we did in
the title, we extract the text using nodeValue
attribute.
That’s it! We have successfully scraped data off the Wikipedia page.
Try going through the documentation to get a clearer picture. If you find the documentation difficult, I’d recommend you to go with some library. If you have any problem following the tutorial, please do post it in the comments below.
The part 2 of this tutorial continues on a real world example of scraping the Anna University website for result into an excel file. If you liked this, please do check this out: Web Scraping 101: Anna University Result Scraper