Automating Multiple Document Downloads from a Webpage using Beautiful Soup

Anna Mowat
Level Up Coding
Published in
6 min readMar 30, 2021

--

One of my favourite uses of code is having it complete boring tasks for me. As a researcher, part of my role involves collecting information so that it can be analyzed by myself and others. Sometimes, googling for and downloading publications can be tedious and repetitive so why don’t we build a “research assistant” to do that for us?

Our “research assistant” will be pulling all the publications on the Asian Development Bank’s Clean Energy Issues page and downloading them directly onto our computer for us. A full google colab file containing my code is available at the end of this article.

Some publications available for download on the website

Building our Research Assistant

We start with importing the two main libraries needed for scraping the website: Requests and BeautifulSoup.

Beautiful Soup is a great python library to use if you want to scrape a website. It’s very effective at straining (like making soup) through the mess of information from HTML documents to get to the information you’re seeking. However, one of the tasks it is not capable of doing is getting the website information from the site of interest into your coding environment. This is the reason we use the request library, because it can get us the website’s html document and bring it into our code.

# Import the necessary libraries to webscrape the publicationsimport requests
from bs4 import BeautifulSoup as soup

Now that we have our libraries imported we will define the URL we want to scrape. We can then feed this URL to the requests library and ask requests to pull the websites information into our code. Once complete, we make the request we made into an instance, and pass it to Beautiful Soup to can create a more readable printout.

# First define the url of interesturl = "https://www.adb.org/sectors/energy/issues/clean-energy"
# Once you have set the url, we can now use the requests library to get the content of the url's html page.html_page = requests.get(url).content
# Now we have the html page we are going to use Beautiful Soup to put the information into a more readable format and then print it below. We call this a soup page.soup_page = soup(html_page, "html")
soup_page

You can see from the print out of the soup page below that the information is still very messy and not yet in an easy to collect format. Fortunately, the Beautiful Soup library has very easy-to-use functionality that will allow us to sift through the data to get to what we are interested in: a list of all the publications on the website.

Snippet of our Soup Page

Reading through the soup page, notice that each separate piece of data is designated by <>? We can use Beautiful Soup to search through each of these data bits to find the publication download links.

# First, notice that each <> that starts with "a" always contains text while <>'s not containing "a" look more like commands telling the html page where a button, or other design element should be. Let's use this to do our first filter.soup_page.findAll("a")

This is better than before, but still has a ton of information we aren’t interested in. Scroll through the print out to get to the lines that detail the publications. Each of these lines have “class = views-list-image” within them. Continue reading through the soup page and you’ll see that other data bits have “class =” within them as well, but with different class names. We can use this class name to filter through further.

# Save the filtered information as an instancelinks = soup_page.findAll("a",{"class": "views-list-image"})
links

With the list of publications, we can use Beautiful Soup to pull out attributes of interest. In our case, we are going to pull the weblink to the page that hosts the publication, and the title of the publication.

# Now that we have the links we can pull out the link to each of the pages where you can download the reports of interestlinks[0].attrs["href"]
# Demonstrating using the same method to get the title name of the first report
links[0].attrs["alt"]

Now that we understand how to get publication’s weblinks and titles we will use a short loop to receive and save the weblinks for each publication.

# A short loop to save the weblinks to each of the publicationspub_links = []for link in links:
pub_links.append("https://www.adb.org" + link.attrs["href"])
pub_links

You might have noticed a slight catch with the URL’s we now have saved. If you click any of them you won’t be brought directly to the PDF of a report, instead you will end up on the webpage the hosts the PDF download and other relevant report information.

We are going to use the same method we used above to find the weblink to the PDFs we are searching for on each of the individual publication pages.

# Investigating what the soup page looks like --> This is always good practice to do with every new web/soup pagehtml_page = requests.get(pub_links[0]).content
soup_page = soup(html_page, "html")
soup_page

The google colab file attached at the bottom of the page shows a step-by-step approach to find the right filtering setting for the individual publication soup pages. In this article we are going to skip directly to the right filter.

# After the step by step investigation to figure out the right attributes to filter by we now have the PDF link for the first publicationpub_link = soup_page.findAll("a",{"title": links[0].attrs["alt"]})[0]
pub_link

The printout from “pub_link” into the code snippet directly above should contain the PDF’s URL, title, src, image link, and image width. Using the same methods described earlier in this article we will now save the titles and weblinks for each of our PDFs.

# Here's a short loop that goes to each publication page and saves (1) the PDF's URL and (2) the publications titlepdfs = []
pdf_names = []
x = 0
for link in pub_links:
html_page = requests.get(link).content
soup_page = soup(html_page, "html")
text = soup_page.findAll("a",{"title": links[x].attrs["alt"]})[0]
pdfs.append(text.attrs["href"])
pdf_names.append(text.attrs["title"])
x = x + 1

Downloading the Publications

Since this is a google colab file we are going to use a library from google colab called files. Jupyter Notebooks may need a different option.

The files library allows us to automate downloading the files we generated with our code.

# Import the necessary libraries
import urllib.request
from google.colab import files
x = 0# First pull the PDF weblinks using the requests libraryfor pdf in pdfs:
response = requests.get(pdf)

# Name the PDF's using the names we saved from the loop above
with open(f'{pdf_names[x]}.pdf', 'wb') as f:
f.write(response.content)
print(f"{pdf_names[x]}")
x = x + 1

# Download each of the named PDF's directly into your download folder
for name in pdf_names:
files.download(f'/content/{name}.pdf')

Once we run the code above you should see the files begin to download onto your computer. The image below shows what I see on google colab during the file download.

Our clean energy reports downloading directly to our computer

Fantastic! Now we have our own downloading “research assistant”. The techniques used within our “research assistant” can be used on other webpages and other file downloads.

Hopefully you can use this in other projects!

--

--