How to extract online data using Python

How to extract online data using Python

Basic concepts about HTML, XPath, Scrapy, and spiders

I would be nice to have all the documents of the website” — One of her colleagues said

“Yeah, that could give us a lot of information” — Said another colleague

“Can you do the scraper?” — They both turn to look at her

Ehhhh… I could….” — She started mumbling

“Perfect” — They both said

“….try” —She finished saying but it was too late

She had never done a scraper in her life. So she was pretty overwhelmed at the moment.

“I don’t know what to do” — She called me crying — “I think that it is too hard for me to do it”

“You don’t worry! We can do it together” — I said

I understood her completely. The first time I had to code a scraper I felt lost as well.

It was like I was watching a magic trick. I remember when I start reading about scraping.

“Web scrapers…mm… HTML tags…mm… spidersWhat….?” — It sounded like a foreign language to me

But the more I read, the more I began to understand that like magic, you need to know what to look for to understand the trick.

What is a web scraper anyway? A web scraper is a program that automatically gathers data off of websites.

We can collect all the content of a website or just specific data about a topic or element. This will depend on the parameters we set in our script. This versatility is the beauty of web scrapers.

Let’s set a hypothetical example. We want to scrape data from a website which URL is https://www.mainwebsite.com. Particularly, this website contains different documents. We are interested in getting their text.

In the main page, we find three subsections as you can see in the drawing below.

Clicking topic1, for example, we’ll take us to another page (https://www.mainwebsite.com/topic1) where we can find a list of documents we are interested in.

If we click on document1, we’ll end up on another page (https://www.mainwebsite.com/topic1/document1/date) where we can obtain the content of that document.

If we were to do it manually, we would copy and paste the content in a file. Instead, we are going to automate this process.

We saw the path we need to follow to get our data. Now, we should find a way to tell the web scraper where to look for the information.

There is a lot of data on the website, such as images, links to other pages, and headers, we are not interested in. As a consequence, we need to be very specific.

Here is where we start to unravel the magic trick. Let’s dissect it then.

1HTML stands for Hypertext Markup Language. Along with Cascading Style Sheets (CSS) and Javascript, it is used for structuring and presenting content on interactive websites.

You don’t need to learn how to code using HTML to build a scraper. But you should know how to identify HTML tags and elements.

Why? Because the data will have a specific HTML tag. And we can extract this data by just showing the scraper the correct HTML element to look for.

An HTML tag consists of a tag name enclosed by angular brackets. Frequently, you need an opening and an ending tag that frame a particular piece of text.

The opening tag consists of a name, followed by optional attributes. The ending tag consists of the same name preceded by a forward slash (/).

Each tag name refers to a particular element. We would pay attention to the following tags: <p> for paragraphs; <a> or anchor tag for hyperlinks; <img> for images; <h1>, <h2>, etc. for text headers; <div> for dividers, <tr> for table rows, and <td> for table columns.

Most tags also take id or class attributes. The id specifies a unique id for that HTML tag within the HTML document. The class is used to define which style that tag would take.

Let’s observe an HTML element:

In this case, we want to extract “28th June 2019 Edition”, the content of the HTML element. We would tell the scraper: look for all <h6> elements and give me the one with class “text-primary”.

If there is more than one element with these characteristics, we would need to be more specific. Indicating the ID attribute can accomplish this.

OK. But where do I look for this information on a website?

This is an easy step: Right-click anywhere on the webpage. A small window will appear. Next, you click Inspect like in the image below.

You’ll have access to the website source code, the images, the CSS, the fonts and icons it uses, the Javascript code.

Moreover, you can use the cursor selector (see the images below) to select an item in the website.

As a consequence, the HTML element corresponding to the selected item will be highlighted.

In the above diagram, we can observe what a typical HTML structure looks like.

Normally, all the content is included inside the opening and closing body tags. Every element has its own tags.

Some HTML elements are nested inside others giving a hierarchy. This can be represented in a tree.

If we move from left to right in the tree, we move forwards generations. If we move top to bottom, we move between the same generation or between siblings when they come from the same parent element.

Pay attention to the two <div> elements. They are siblings because they share <body> as a parent. They are the second descendant of html element. Each of them has children. The first <div> has two children. Its first child is a paragraph containing “Web scraping is useful!” element. However, this element is not a descendant of the second <div>. This is due to the fact that you can not follow a path from this div element to the paragraph element.

These relationships will help us also when indicating the desired element to the web scraper.

2XPath stands for XML Path Language. What does it have to do with web scraping? We’ll learn how to identify HTML elements. But the question that arises now is how do I point out the element to the scraper? And the answer is XPath.

XPath is a special syntax that can be used to navigate through elements and attributes in an XML document. Also, it will help us get a path to a certain HTML element and extract its content.

Let’s see how this syntax works.

/ is used to move forward one generation, tag-names gives the direction to which element, [] tell us which of the siblings to choose, // looks for all future generations, @ selects attributes, * is a wildcard indicating we want to ignore tag types.

If we see the following XPath:

Xpath = '//div[@class="first"]/p[2]'

we would understand that from all (//) the div elements with classfirst” (div[@class="first"]), we want the second ([2]) paragraph (p) element.

Fortunately, web browsers have an easy way to get the XPath of an element.

When you are inspecting the website, right-click in the highlighted element. A small window will be displayed. You can then copy the XPath.

3Scrapy is a Python framework designed for crawling web sites and extracting structured data. It was specially designed for web scraping but nowadays it can also be used to extract data using APIs.

In order to install Scrapy, you need to have Python installed. It is advisable to work only with Python 3. Python 2 is going to be deprecated in January 2020.

To install Scrapy, you can do it using pip:

pip install Scrapy

or using conda

conda install -c conda-forge scrapy

One important aspect of Scrapy is that it uses Twisted, a popular event-driven networking framework for Python. Twisted works asynchronously for concurrency.

What does this mean? Synchronous means that you have to wait for a job to be completed in order to start a new job. Asynchronous means you can move to another job before the previous job has finished.

4Spiders. Because of this characteristic, Scrapy can crawl a group of URLs in a very short time. Consequently, instead of scraping on a single website, Scrapy works with spiders.

Spiders are classes we define and Scrapy uses to crawl multiple pages following links and scrape information.

Spider structure:

Spiders must meet certain requirements to work correctly. They must subclass scrapy.Spider, and define the initial requests to make. Also, they can determine how to follow in the pages and how to parse the downloaded page content.

Let’s see these requirements in detail:

  1. Every spider must be a subclass of the scrapy.Spider class: This means that it must take it as an argument.
  2. The name of the Spider must be unique within a project.
  3. They must define the initial requests to make: There must be a method call start_requests(). Scrapy will always look for it to initiate the requests. It also must return an iterable of Requests which the Spider will begin to crawl from.
  4. They can determine how to parse the downloaded content: Normally, a parse() method is defined. We call it to handle the response downloaded for each of the requests made. The parse() method usually parses the response, extracting the scraped data and also finding new URLs to follow and creating new requests from them.
  5. We can also find the allowed_domains list. This tells the spider what are the domain names that it is allowed to scrape.
  6. Also, we can seta start_urls list. It is used to specify what website we want to scrape. By default, Scrapy uses the HTTP protocol. It has to be changed to https.

Now, we have dissected all the components of a web scraper.

→ It’s time to write it!! ←

We’ll bring our initial example of the website with URL https://www.mainwebsite.com.

Let’s review the facts:

  • We have a main website with three links to three different sections.
  • In each section, we have a list of links to documents. Each section has a specific URL, e.g.https://www.mainwebsite.com/topic1.
  • Every link takes us to the document content that we are interested in. We can find every link in the HTML structure of each section.

First, we’ll design our file architecture.

Let’s explore our folders.

We’ve created a master folder called scraper where we are going to store all the files related to our scraper.

Then, we’ll collect all the scraped data in JSON files. Each of those files will be saved in the JSON folder.

The common folder has another folder called spiders. There, we’ll save one file for each spider. And we’ll create one spider for each topic. So, in total three spiders.

Now, it’s time to understand the files we’ve created.

Let’s start with the settings.py. The Scrapy settings allow us to customize the behavior of all Scrapy components, including the core, extensions, pipelines, and spiders themselves.

There, we can specify The name of the bot implemented by the Scrapy project, a list of modules where Scrapy will look for spiders and whether the HTTP cache will be enabled, among others.

Now, we arrive at the main two files.

AWe’ll start by the topic1.py spider. We’ll examine only one example as they are all very similar.

The first thing that we need to do is import all the needed libraries.

import scrapy
import re
import json
import os

Obviously, we need to import scrapy. The re module will allow us to extract information using regular expressions. The json module will help us when saving information. The os module is useful to handle directories.

We stated before that a spider has to inherit from the scrapy.Spider. So we’ll create a class called FirstSpider that subclass it. We’ll assign the name topic1. Then, we’ll define the allowed_domains list.

class FirstSpider(scrapy.Spider):
    name = "topic1"
    allowed_domains = ['www.mainwebsite.com']

We also need to create the start_request() method to initialize the requests. Inside the method, we define a list of URL for the requests. In our case, this list only contains the URL www.mainwebsite.com/topic1. Then, we are going to make a request with scrapy.Request.

We’ll use yield instead of return. We’ll tell scrapy to handle the downloaded content using the parse() method inside the callback argument.

   # Initialize requests
    def start_requests(self):
        #List of URL to request
        urls = ['https://www.mainwebsite.com/topic1']
        for url in urls:
            #We use yield and use parse as a method to parse
            #the information
            yield scrapy.Request(url=url, callback=self.parse)

Until now, you might think that the explanation about HTML and XPath was quite useless. Well, now it’s the moment we’ll need it.

After we define our method to start the initial request, we need to define the method that will handle our downloaded information.

So in other words, we need to decide what we want to do with all the data. What information is worth it to save.

For this, let’s suppose this is the HTML structure of our website.

As you can see in the picture, the highlighted element is the element we need to get to extract our links.

Let’s construct our path to get there. From all (//) thediv elements that have the class col-md-12 (div[@class='col-md-12']), we need the attribute href from the a children (a/@href).

So, we have then our XPath: //div[@class='col-md-12']/a/@href.

In our parse method, we’ll use response.xpath() to indicate the path and extract() to extract the content of every element.

We are expecting to get a list of links. We want to extract what is shown in those links. The spider will need to follow each of them and parse their content using a second parse method that we’ll call parse_first.

Notice that this time we are sending the links using follow in the response variable instead of creating a Request.

    def parse(self, response):
        links = response.xpath('//div[@class="col-md-12"]/a/@href').extract()
        for link in links:
            yield response.follow(url=link, callback=self.parse_first)

Next, the parse_first method has to be defined to tell the spider how to follow the links.

We are going to extract the title and the body of the document.

After exploring the HTML structure of one document, we’ll get any element which id is titleDocument, and all paragraphs that are a child of any element which id is BodyDocument.

Because we don’t care about which tag they have we’ll use the *.

After getting each paragraph, we are going to append them to a list.

    def parse_first(self, response):
        text = []
        title = response.xpath("//*[@id='titleDocument']").get()
        for paragraph in response.xpath("//*[@id='bodyDocument']/p"):
            texto.append(paragraph.get())

After that, we’ll join all the paragraphs in the text list together. We’ll extract the date. Finally, we’ll define a dictionary with the date, title and text.

        # we join all the elements of the list together
        text = " ".join(text)

        # We extract the date 
        date = self.extractdate(text)
        
        document = {
            "date": date,
            "title": title,
            "text": text
        }

Lastly, we’ll save the data into a JSON file.

        json_file = "./json/documents.json"
        with open(json_file) as file:
            data = json.load(file, encoding='utf-8')

        if type(data) is dict:
            data = [data]

        data.append(document)

        with open(json_file, 'w') as file:
            json.dump(data, file, ensure_ascii=False)

Here it’s the definition of the function extractdate where we’ll use regular expressions to extract the date.

    # Formating date as YYYY-MM-DD HH:MM:SS
    def extractdate(self, text):
        date = re.search("(\d{2}/\d{2}/\d{4})$", text).group(1)
        return date

Now, our spider is complete.

BIt’s time to investigate the scraper.py file. Not only we need to create spiders, but also we need to launch them.

First, we’ll import the required modules from Scrapy. CrawlerProcess will initiate the crawling process and settings will allow us to arrange the settings.

We’ll also import the three spider class created for each topic.

# Import scrapy modules
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings


from common.spiders.topic1 import FirstSpider
from common.spiders.topic2 import SecondSpider
from common.spiders.topic3 import ThirdSpider

After that, we initiate a crawling process

# Initiate a Crawling process
process = CrawlerProcess()

We tell the process which spiders to use and finally, we’ll start the crawling.

# Tell the scraper which spider to use
process.crawl(FirstSpider)
process.crawl(SecondSpider)
process.crawl(ThirdSpider)

# Start the crawling
process.start()

Perfect! We now have our scraper built!!!

But wait how do we actually start scraping our website?

In the terminal, we navigate with command line to our scraper folder (using cd). Once inside, we just launch the spiders with the python3 command you can see in the picture.

And voilà! The spiders are crawling the website!

Here, I listed a couple of very nice resources and courses to learn more about web scraping:

  1. DataCamp Course.
  2. Web Scraping tutorial
  3. Scrapy documentation
  4. HTML long and short explanation

Source: towardsdatascience