Originally posted on medium
Extracting data and ensuring data quality
This is the second article of my web scraping guide. In the first article, I showed you how you can find, extract, and clean the data from one single web page on IMDb.
In this article, you’ll learn how to scrape multiple web pages — a list that’s 20 pages and 1,000 movies total — with a Python web scraper.
Where We Left Off
In the previous article, we scraped and cleaned the data of the
year of release,
length of movie, number of
votes, and the
us_gross earnings of all movies on the first page of IMDb’s Top 1,000 movies.
This was the code we used:
And our results looked like this:
What We’ll Cover
I’ll be guiding you through these steps:
- You’ll request the unique URLs for every page on this IMDb list.
- You’ll iterate through each page using a
forloop, and you’ll scrape each movie one by one.
- You’ll control the loop’s rate to avoid flooding the server with requests.
- You’ll extract, clean, and download this final data.
- You’ll use basic data-quality best practices.
Introducing New Tools
These are the additional tools we’ll use in our scraper:
Time to Code
As mentioned in the first article, I recommend following along in a Repl.it environment if you don’t already have an IDE.
I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code beforehand.
You can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight changes.
Alternatively, you can go straight to the code here.
Now, let’s begin!
Let’s import our previous tools and our new tools —
Initialize your storage
Like previously, we’re going to continue to use our empty lists as storage for all the data we scrape:
English movie titles
After we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:
Analyzing our URL
Let’s go to the URL of the page we‘re scraping.
Now, let’s click on the next page and see what page 2’s URL looks like:
And then page 3’s URL:
What do we notice about the URL from page 2 to page 3?
&start=51 is added into the URL when we go to page 2, and the number
51 turns to the number
101 on page 3.
This makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so on.
Why is this important? This information will help us tell our loop how to go to the next page to scrape.
Refresher on ‘
Just like the loop we used to loop through each movie on the first page, we’ll use a
for loop to iterate through each page on the list.
To refresh, this is how a
for loop works:
for <variable> in <iterable>: <statement(s)>
<iterable> is a collection of objects—e.g. a list or tuple. The
<statement(s)> are executed once for each item in
<iterable>. The loop
<variable> takes on the value of the next element in
<iterable> each time through the loop.
Changing the URL Parameter
As I mentioned earlier, each page’s URL follows a certain logic as the web pages change. To make the URL requests we’d have to vary the value of the page parameter, like this:
Breaking down the URL parameters:
pagesis the variable we create to store our page-parameter function for our loop to iterate through
np.arrange(1,1001,50)is a function in the NumPy Python library, and it takes four arguments — but we’re only using the first three which are:
stepis the number that defines the spacing between each. So: Start at
1, stop at
1001, and step by
1: This will be our first page’s URL.
1001: Why stop at 1001? The number in the stop parameter is the number that defines the end of the array, but it isn’t included in the array. The last page for movies would be at the URL number of 951. This page has movies 951-1000. If we used 951, it wouldn’t include this page in our scraper, so we have to go one page further to make sure we get the last page.
50: We want the URL number to change by 50 each time the loop comes around — this parameter tells it to do that.
Looping Through Each Page
Now we need to create another
for loop that’ll loop our scraper through the pages function we created above, which loops through each different URL we need. We can do this simply like this:
Breaking this loop down:
pageis the variable that’ll iterate through our
pagesis the function we created:
Requesting the URL + ‘html_soup’ + ‘movie_div’
Inside this new loop is where we’ll request our new URLs, add our
html_soup (helps us parse the HTML files), and add our
movie_div (stores each div container we’re scraping). This is what it’ll look like:
pageis the variable we’re using which stores each of our new URLs
requests.get()is the method we use to grab the contents of each URL
“https://www.imdb.com/search/title/?groups=top_1000&start="is the part of the URL that stays the same when we change each page
+ str(page)tells the request to add each iteration of
page(the page function we’re using to change the page number of the URL) into the URL request. It also tells it to make sure it’s a string we’re using — not an integer or float — because it’s an URL link we’re building.
+ “&ref_=adv_nxt”is added to the end of every URL because this also does not change when we go to the next page
headers=headerstells our scraper to bring us English-translated content from the URLs we’re requesting
soupis the variable we create to assign the method
BeautifulSoupis a method we’re using that specifies a desired format of results
(page.text, “html.parser")grabs the text contents of
pageand uses the HTML parser — this allows Python to read the components of the page rather than treating it as one long string
movie_divis the variable we use to store all of the
divcontainers with a class of
find_all()method extracts all the
divcontainers that have a
lister-item mode-advancedfrom what we’ve stored in our variable
Controlling the Crawl Rate
Controlling the crawl rate is beneficial for the scraper and for the website we’re scraping. If we avoid hammering the server with a lot of requests all at once, then we’re much less likely to get our IP address banned — and we also avoid disrupting the activity of the website we scrape by allowing the server to respond to other user requests as well.
We’ll be adding this code to our new
Breaking crawl rate down:
sleep()function will control the loop’s rate by pausing the execution of the loop for a specified amount of time
randint(2,10)function will vary the amount of waiting time between requests for a number between 2-10 seconds. You can change these parameters to any that you like.
Please note that this will delay the time it takes to grab all the data we need from every page, so be patient. There are 20 pages with a max of 10 seconds per loop, so it’d take a max of 3.5 minutes to get all of the data with this code.
It’s very important to practice good scraping and to scrape responsibly!
Our code should now look like this:
We can add our scraping
for loop code into our new
Pointing Out Previous Errors
I’d like to point out a slight error I made in the previous article — a mistake I made regarding the cleaning of the
I received this DM from an awesome dev who was running through my article and coding along but with a different IMDb URL than the one I used to teach in the guide.
In the extracting
metascore data code, we wrote this:
This extraction code says if there is Metascore data there, grab it — but if the data is missing, then put a dash there and continue.
In the cleaning of the
metascore data code, we wrote this:
This cleaning code says to turn this pandas object into an integer data type, which worked for my URL I scraped because I didn’t have any missing Metascore data — e.g., no dashes in place of missing data.
What I failed to notice is if someone scraped a different IMDb page than I did, they’d possibly have missing
metascore data there, and once we scraped multiple pages in this guide, we’ll have missing
metascore data as well.
What does this mean?
It means when we do get those dashes in place of missing data, we can’t use the code
.astype(int) to convert that entire
metascore data into an integer like I previously used — this would produce an error. We’d need to turn our
metascore data into a float data type (decimal).
Fixing the Cleaning of the
Metascore Data Code
Instead of this
metascore data cleaning code:
We’ll use this:
Breaking down the new cleaning of the Metascore data:
movies[‘metascore’]is our Metascore data in our movies
DataFrame. We’ll be assigning our new cleaned up data to our
movies[‘metascore’]tells pandas to go to the column
.str.extract(‘(\d+’)— this method:
(‘(\d+’)says to extract all the digits in the string
movies[‘metascore’]is stripped of the elements we don’t need, and now we’ll assign the conversion code data to it to finish it up
pd.to_numericis a method we use to change this column to a float. The reason we use this is because we have a lot of dashes in this column, and we can’t just convert it to a float using
.astype(float)— this would catch an error.
errors=’coerce’will transform the nonnumeric values, our dashes, into not-a-number (NaN) values because we have dashes in place of the data that’s missing.
Add the DataFrame and Cleaning Code
Let’s add our
DataFrame and cleaning code to our new scraper, which will go below our loops. If you have any questions regarding how this code works, go to the first article to see what each line executes.
The code should look like this:
Save to CSV
We have all the elements of our scraper ready — now it’s time to save all the data we’re about to scrape into our CSV.
Below is the code you can add to the bottom of your program to save your data to a CSV file:
In case you need a refresher, if you’re in Repl.it, you can create an empty CSV file by hovering near “Files” and clicking the “Add file” option. Name it, and save it with a
.csv extension. Then, add the code to the end of your program:
If we run and save our
.csv, we should get a file with a list of movies and all the data from 0-999:
Basic Data-Quality Best Practices (Optional)
Here, I’ll discuss some basic data-quality tricks you can use when cleaning your data. You don’t need to apply any of this to our final scraper.
Usually, a dataset with a lot of missing data isn’t a good dataset at all. Below are ways we can look up, manipulate, and change our data — for future reference.
One of the most common problems in a dataset is missing data. In our case, the data wasn’t available. There are a couple of ways to check and deal with missing data:
- Check where we’re missing data and how much is missing
- Add in a default value for the missing data
- Delete the rows that have missing data
- Delete the columns that have a high incidence of missing data
We’ll go through each of these in turn.
Check missing data:
We can easily check for missing data like this:
This shows us where the data is missing and how much data is missing. We have 165 missing values in
metascore and 161 missing in
us_grossMillions— a total of 326 missing data in our dataset.
Add default value for missing data:
If you wanted to change your NaN values to something else specific, you can do so like this:
For this example, I want the words
“None Given” in place of
metascore NaN values and empty quotes (nothing) in place of
us_grossMillions NaN values.
If you print those columns, you can see our NaN values have been changed as specified:
metascore column was an
int, and our
us_grossMillions column was a
float prior to this change — and you can see how they’re both
objects now because of the change. Be careful when changing your data, and always check to see what your data types are when making any alterations.
Delete rows with missing data:
Sometimes the best route to take when having a lot of missing data is to just remove them altogether. We can do this a couple of different ways:
Delete columns with missing data:
Sometimes when we have too many missing values in a column, it’s best to get rid of them. We can do so like this:
axis=1is the parameter we use — it means to operate on columns, not rows.
Axis=0means rows. We could’ve used this parameter in our delete-rows section, but the default is already
0, so I didn’t use it.
how=‘any’means if any
NAvalues are present to drop that column.
The Final Code
There you have it! We’ve successfully extracted data of the top 1,000 best movies of all time on IMDb, which included multiple pages, and saved it into a CSV file.
I hope you enjoyed building a Python scraper. If you followed along, let me know how it went.