Basic concepts about HTML, XPath, Scrapy, and spiders
“I would be nice to have all the documents of the website” — One of her colleagues said
“Yeah, that could give us a lot of information” — Said another colleague
“Can you do the scraper?” — They both turn to look at her
“Ehhhh… I could….” — She started mumbling
“Perfect” — They both said
“….try” —She finished saying but it was too late
She had never done a scraper in her life. So she was pretty overwhelmed at the moment.
“I don’t know what to do” — She called me crying — “I think that it is too hard for me to do it”
“You don’t worry! We can do it together” — I said
I understood her completely. The first time I had to code a scraper I felt lost as well.
It was like I was watching a magic trick. I remember when I start reading about scraping.
“Web scrapers…mm… HTML tags…mm… spiders… What….?” — It sounded like a foreign language to me
But the more I read, the more I began to understand that like magic, you need to know what to look for to understand the trick.
What is a web scraper anyway? A web scraper is a program that automatically gathers data off of websites.
We can collect all the content of a website or just specific data about a topic or element. This will depend on the parameters we set in our script. This versatility is the beauty of web scrapers.
Let’s set a hypothetical example. We want to scrape data from a website which URL is
https://www.mainwebsite.com. Particularly, this website contains different documents. We are interested in getting their text.
In the main page, we find three subsections as you can see in the drawing below.
topic1, for example, we’ll take us to another page (
https://www.mainwebsite.com/topic1) where we can find a list of documents we are interested in.
If we click on
document1, we’ll end up on another page (
https://www.mainwebsite.com/topic1/document1/date) where we can obtain the content of that document.
If we were to do it manually, we would copy and paste the content in a file. Instead, we are going to automate this process.
We saw the path we need to follow to get our data. Now, we should find a way to tell the web scraper where to look for the information.
There is a lot of data on the website, such as images, links to other pages, and headers, we are not interested in. As a consequence, we need to be very specific.
Here is where we start to unravel the magic trick. Let’s dissect it then.
You don’t need to learn how to code using HTML to build a scraper. But you should know how to identify HTML tags and elements.
Why? Because the data will have a specific HTML tag. And we can extract this data by just showing the scraper the correct HTML element to look for.
An HTML tag consists of a tag name enclosed by angular brackets. Frequently, you need an opening and an ending tag that frame a particular piece of text.
The opening tag consists of a name, followed by optional attributes. The ending tag consists of the same name preceded by a forward slash (
Each tag name refers to a particular element. We would pay attention to the following tags:
<p> for paragraphs;
<a> or anchor tag for hyperlinks;
<img> for images;
<h2>, etc. for text headers;
<div> for dividers,
<tr> for table rows, and
<td> for table columns.
Most tags also take
class attributes. The
id specifies a unique id for that HTML tag within the HTML document. The
class is used to define which style that tag would take.
Let’s observe an HTML element:
In this case, we want to extract “28th June 2019 Edition”, the content of the HTML element. We would tell the scraper: look for all <h6> elements and give me the one with class “text-primary”.
If there is more than one element with these characteristics, we would need to be more specific. Indicating the ID attribute can accomplish this.
OK. But where do I look for this information on a website?
This is an easy step: Right-click anywhere on the webpage. A small window will appear. Next, you click Inspect like in the image below.
Moreover, you can use the cursor selector (see the images below) to select an item in the website.
As a consequence, the HTML element corresponding to the selected item will be highlighted.
In the above diagram, we can observe what a typical HTML structure looks like.
Normally, all the content is included inside the opening and closing
body tags. Every element has its own tags.
Some HTML elements are nested inside others giving a hierarchy. This can be represented in a tree.
If we move from left to right in the tree, we move forwards generations. If we move top to bottom, we move between the same generation or between siblings when they come from the same parent element.
Pay attention to the two
<div> elements. They are siblings because they share
<body> as a parent. They are the second descendant of
html element. Each of them has children. The first
<div> has two children. Its first child is a paragraph containing “Web scraping is useful!” element. However, this element is not a descendant of the second
<div>. This is due to the fact that you can not follow a path from this div element to the paragraph element.
These relationships will help us also when indicating the desired element to the web scraper.
2XPath stands for XML Path Language. What does it have to do with web scraping? We’ll learn how to identify HTML elements. But the question that arises now is how do I point out the element to the scraper? And the answer is XPath.
XPath is a special syntax that can be used to navigate through elements and attributes in an XML document. Also, it will help us get a path to a certain HTML element and extract its content.
Let’s see how this syntax works.
/ is used to move forward one generation,
tag-names gives the direction to which element,
 tell us which of the siblings to choose,
// looks for all future generations,
@ selects attributes,
* is a wildcard indicating we want to ignore tag types.
If we see the following XPath:
Xpath = '//div[@class="first"]/p'
we would understand that from all (
div elements with
class “first” (
div[@class="first"]), we want the second (
) paragraph (
Fortunately, web browsers have an easy way to get the XPath of an element.
When you are inspecting the website, right-click in the highlighted element. A small window will be displayed. You can then copy the XPath.
3Scrapy is a Python framework designed for crawling web sites and extracting structured data. It was specially designed for web scraping but nowadays it can also be used to extract data using APIs.
In order to install Scrapy, you need to have Python installed. It is advisable to work only with Python 3. Python 2 is going to be deprecated in January 2020.
To install Scrapy, you can do it using pip:
pip install Scrapy
or using conda
conda install -c conda-forge scrapy
One important aspect of Scrapy is that it uses Twisted, a popular event-driven networking framework for Python. Twisted works asynchronously for concurrency.
What does this mean? Synchronous means that you have to wait for a job to be completed in order to start a new job. Asynchronous means you can move to another job before the previous job has finished.
4Spiders. Because of this characteristic, Scrapy can crawl a group of URLs in a very short time. Consequently, instead of scraping on a single website, Scrapy works with spiders.
Spiders are classes we define and Scrapy uses to crawl multiple pages following links and scrape information.
Spiders must meet certain requirements to work correctly. They must subclass
scrapy.Spider, and define the initial requests to make. Also, they can determine how to follow in the pages and how to parse the downloaded page content.
Let’s see these requirements in detail:
- Every spider must be a subclass of the
scrapy.Spiderclass: This means that it must take it as an argument.
nameof the Spider must be unique within a project.
- They must define the initial requests to make: There must be a method call
start_requests(). Scrapy will always look for it to initiate the requests. It also must return an iterable of Requests which the Spider will begin to crawl from.
- They can determine how to parse the downloaded content: Normally, a
parse()method is defined. We call it to handle the response downloaded for each of the requests made. The parse() method usually parses the response, extracting the scraped data and also finding new URLs to follow and creating new requests from them.
- We can also find the
allowed_domainslist. This tells the spider what are the domain names that it is allowed to scrape.
- Also, we can seta
start_urlslist. It is used to specify what website we want to scrape. By default, Scrapy uses the HTTP protocol. It has to be changed to https.
Now, we have dissected all the components of a web scraper.
→ It’s time to write it!! ←
We’ll bring our initial example of the website with URL
Let’s review the facts:
- We have a main website with three links to three different sections.
- In each section, we have a list of links to documents. Each section has a specific URL, e.g.
- Every link takes us to the document content that we are interested in. We can find every link in the HTML structure of each section.
First, we’ll design our file architecture.
Let’s explore our folders.
We’ve created a master folder called
scraper where we are going to store all the files related to our scraper.
Then, we’ll collect all the scraped data in JSON files. Each of those files will be saved in the
The common folder has another folder called
spiders. There, we’ll save one file for each spider. And we’ll create one spider for each topic. So, in total three spiders.
Now, it’s time to understand the files we’ve created.
Let’s start with the
settings.py. The Scrapy settings allow us to customize the behavior of all Scrapy components, including the core, extensions, pipelines, and spiders themselves.
There, we can specify The name of the bot implemented by the Scrapy project, a list of modules where Scrapy will look for spiders and whether the HTTP cache will be enabled, among others.
Now, we arrive at the main two files.
AWe’ll start by the
topic1.py spider. We’ll examine only one example as they are all very similar.
The first thing that we need to do is import all the needed libraries.
Obviously, we need to import
re module will allow us to extract information using regular expressions. The
json module will help us when saving information. The
os module is useful to handle directories.
We stated before that a spider has to inherit from the
scrapy.Spider. So we’ll create a class called
FirstSpider that subclass it. We’ll assign the
name topic1. Then, we’ll define the
We also need to create the
start_request() method to initialize the requests. Inside the method, we define a list of URL for the requests. In our case, this list only contains the URL
www.mainwebsite.com/topic1. Then, we are going to make a request with
yield instead of
return. We’ll tell scrapy to handle the downloaded content using the
parse() method inside the callback argument.
Until now, you might think that the explanation about HTML and XPath was quite useless. Well, now it’s the moment we’ll need it.
After we define our method to start the initial request, we need to define the method that will handle our downloaded information.
So in other words, we need to decide what we want to do with all the data. What information is worth it to save.
For this, let’s suppose this is the HTML structure of our website.
As you can see in the picture, the highlighted element is the element we need to get to extract our links.
Let’s construct our path to get there. From all (
div elements that have the
class col-md-12 (
div[@class='col-md-12']), we need the attribute
href from the
a children (
So, we have then our XPath:
parse method, we’ll use
response.xpath() to indicate the path and
extract() to extract the content of every element.
We are expecting to get a list of links. We want to extract what is shown in those links. The spider will need to follow each of them and parse their content using a second parse method that we’ll call
Notice that this time we are sending the links using
follow in the response variable instead of creating a Request.
parse_first method has to be defined to tell the spider how to follow the links.
We are going to extract the title and the body of the document.
After exploring the HTML structure of one document, we’ll get any element which
id is titleDocument, and all paragraphs that are a child of any element which
id is BodyDocument.
Because we don’t care about which tag they have we’ll use the
After getting each paragraph, we are going to append them to a list.
After that, we’ll join all the paragraphs in the text list together. We’ll extract the date. Finally, we’ll define a dictionary with the
Lastly, we’ll save the data into a JSON file.
Here it’s the definition of the function
extractdate where we’ll use regular expressions to extract the date.
Now, our spider is complete.
BIt’s time to investigate the
scraper.py file. Not only we need to create spiders, but also we need to launch them.
First, we’ll import the required modules from Scrapy.
CrawlerProcess will initiate the crawling process and
settings will allow us to arrange the settings.
We’ll also import the three spider class created for each topic.
After that, we initiate a crawling process
We tell the process which spiders to use and finally, we’ll start the crawling.
Perfect! We now have our scraper built!!!
But wait how do we actually start scraping our website?
In the terminal, we navigate with command line to our scraper folder (using
cd). Once inside, we just launch the spiders with the
python3 command you can see in the picture.
And voilà! The spiders are crawling the website!
Here, I listed a couple of very nice resources and courses to learn more about web scraping: