The Art of Webscraping

If data is the new oil, then webscraping is the new fracking

Timo Kats
Level Up Coding

--

Photo by Raimond Klavins on Unsplash

Introduction

Roughly 15 years ago Clive Humby coined the phrase “data is the new oil”. This phrase was later elaborated upon by Michael Palmer who stated that just like oil, data is only useful when processed correctly. Moreover, the commonalities between oil and data don’t stop there, because just like oil, data has a supply chain.

This supply chain starts with the acquisition of data from a given source. The largest source by far to acquire data from would be the internet. Recent calculations have estimated that the internet had a total size of 40 zettabytes in 2020. That’s a lot of zeros!

To unlock the potential of this enormous amount of data we need to acquire and format it in a way that’s useable for further analysis. This process is commonly referred to as webscraping. In this article we’ll explore this process by scraping and formatting an article I wrote a while back called; “Do You Need to Know Math for Machine Learning”.

Getting all the required tools

There are many ways to collect data from websites with but in this article we’ll use Python. Python is a very popular programming language that has many libraries available that can be very useful for webscraping. The libraries that are used in this article are; BeautifulSoup, requests, json and re.

The only library that’s probably not included in your Python environment by default is BeautifulSoup. If you want to add this library to your environment just use the Python Package Installer (aka pip). If you use PyCharm (or any other IDE) then please follow their package installation guidelines.

pip install beautifulsoup4

After installing the packages we can start coding the scraper. The only real requirement here is the URL of the page we want to scrape data from (to reiterate; my article mentioned earlier). You can imagine that for large webscrapers it’s possible to automate this URL in order to collect a multitude of pages on a certain domain, but we’ll stick to one page for this example.

from bs4 import BeautifulSoup
import requests, json, re
url = "https://medium.com/analytics-vidhya/do-you-need-to-know-math-for-machine-learning-d51f0206f7e4"page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

Acquiring the data

After installing and setting up your environment you can start scraping the URL. What’s important to keep in mind is that websites are essentially huge chunks of text translated to a webpage by a browser. To get an idea of what this text looks like just press f12 on this article. That’s a lot of jargon!

The one rule that’s important to keep in mind when navigating all this text is that there is no golden rule to find what you’re looking for. Websites store their data differently and if you want to scrape them you’ll have to adapt to that.

This is the same for medium articles. After a while of tinkering and searching I found that medium saves its data in a json string at the top of the code between the html-tags <script data-rh="true" type="application/ld+json"> and </script>.

This is a good opportunity for webscraping and it’s important to recognize stuff like this when you try this process out for yourself. In order to separate this json string from the rest of the code we can simply cut it out using string indices. To do this use the following code snippet.

def get_json(soup):
index_1 = soup.find('<script data-rh="true" type="application/ld+json">')
index_2 = soup.find('</script>', index_1)
json_string = soup[index_1 + 50:index_2]
return json.loads(json_string)
soup = str(soup) # required to use string functions
json = get_json(soup)
for item, value in json.items():
print(item, value, sep=' = ', end='\n\n')

This outputs all the meta data that medium provides in this json (which is a lot!).

@context = http://schema.org

@type = NewsArticle

image = ['https://miro.medium.com/max/1200/1*IHv0J-i2WvawIxU9MEAe8Q.png']

url = https://medium.com/analytics-vidhya/do-you-need-to-know-math-for-machine-learning-d51f0206f7e4

dateCreated = 2021-04-06T15:25:58.210Z

datePublished = 2021-04-06T15:25:58.210Z

dateModified = 2021-04-06T18:36:25.041Z

headline = Do You Need to Know Math for Machine Learning? - Analytics Vidhya - Medium

name = Do You Need to Know Math for Machine Learning? - Analytics Vidhya - Medium

description = Machine learning has become a popular field in the tech industry. Nowadays almost the exclusive majority of computer science related studies have a machine learning course in their curriculum. Most…

identifier = d51f0206f7e4

keywords = ['Lite:true', 'Tag:Machine Learning', 'Tag:Computer Science', 'Tag:Programming', 'Tag:Python', 'Tag:Math', 'Publication:analytics-vidhya', 'Elevated:false', 'LockedPostSource:LOCKED_POST_SOURCE_UGC', 'LayerCake:3']

author = {'@type': 'Person', 'name': 'Timo Kats', 'url': 'https://timokats.medium.com'}

creator = ['Timo Kats']

publisher = {'@type': 'Organization', 'name': 'Analytics Vidhya', 'url': 'https://medium.com/analytics-vidhya', 'logo': {'@type': 'ImageObject', 'width': 208, 'height': 60, 'url': 'https://miro.medium.com/max/416/1*66g0UGKgu4oopIC0ahQuXw.png'}}

mainEntityOfPage = https://medium.com/analytics-vidhya/do-you-need-to-know-math-for-machine-learning-d51f0206f7e4

isAccessibleForFree = False

Formatting the data

Now that the data is acquired we can start formatting it. Since the data was extracted from a json string this isn’t really nessecairy because json already a pretty good format. However, for this example we’ll re-format the data to a csv file with the following fields; identifier, date_published, name, creator, publication. In order to do this we first need to select those fields from the json file and then write them to a csv file. Let’s start with the first part.

When looking at the fields from the collected json string (see code snippet above) it’s apparent that some fields have multiple values whilst others are singular. As a result of this we need to do some tailormade searching to get the values we want. Thereafter these values are converted to a dictionary. This last part isn’t mandatory, but it does make it a lot easier to write the data to a csv file.

selected_fields = ['identifier', 'datePublished', 'name', 'creator', 'publisher']
data = {}
for field in selected_fields:
if field == 'creator':
data[field] = json[field][0]
elif field == 'publisher':
data[field] = json[field]['name']
else:
data[field] = json[field]

Now that the data is loaded in a dictionary we can finally export it to a csv file.

csv = open('medium_article.csv', 'a', encoding='utf-8')
for key in data.keys():
csv.write(key + ',')
csv.write('\n')
for value in data.values():
csv.write(value + ',')

Finally we have a csv file with the fields mentioned earlier! Because there’s only one row in the data this might seem a bit underwhelming but imagine doing this for all of the articles in a given publication. That could be become a really good data source for many data science projects relating to medium articles!

Screenshot from the created csv file in excel

Final words

This article hopefully showed how easy and useful webscraping is. If you want to use the code provided in this article to work on your own webscraper then feel free to do so, a complete version of the code used in this article is given below. Thank you for reading this article!

--

--