A Web Scraper for Internships

Kevin Guo
Level Up Coding
Published in
4 min readApr 18, 2021

--

If you’re a student applying to internships for the summer, it can be a chore — not just filling out applications, but also finding them. New opportunities are being posted every day from July to January and it’s important to apply as soon as possible to maximize chances of getting a response, but it’s a pain to check for new postings. With this in mind, I wrote something that I wish I had when applying for internships late last year: a web scraper that pulls from LinkedIn as well as GitHub repositories such as https://github.com/pittcsc/Summer2022-Internships.

The web scraper is written in Python and utilizes aiohttp and asyncio to make multiple HTTP requests simultaneously and Beautiful Soup for scraping functionality. Unfortunately, LinkedIn’s rate limiting slows the process considerably and prevents us from utilizing asyncio to its fullest potential, but it was a good opportunity to get experience writing asynchronous code in Python. Currently, output is being written to a file, and you can check out the text files within the project repository for example internship listings it found.

What output looks like

Users are able to specify multiple LinkedIn search strings to use such as “software engineer intern” and “program management intern”, as well as multiple locations such as “San Francisco” and “New York City”. They are also able to query specifically for jobs posted in the last day, week, or month, and also specify substrings that we want to require/blacklist in the job title/description. Of course, these search options allow us to scrape for any type of full-time job as well. All configuration exists within a YAML config file in the project directory.

aiohttp example that makes requests simultaneously
What the config looks like

The good thing about scraping LinkedIn is that you don’t need authorization to browse jobs. The bad thing is that LinkedIn makes it difficult to scrape through various methods such as rate limiting as mentioned earlier, pagination and preventing access to different pages through request parameters (when you load new jobs by scrolling down, you can see a request has been made with the parameter “&start={starting index of the next job}”, but putting this into your request will result in it being redirected to “&start=0”. Grabbing additional results should be viable with a web driver like Selenium, however), and finicky location filtering.

The difficulties with pagination means you’re limited to only 25 or so results when you make a single search, so the only way to grab a decent amount of results is to do lots of narrowed-down searches by search string and location. Searching by location has roadblocks because when you normally search for a city such as “San Mateo” in the location field of the search bar, only half or so of the results are actually based in San Mateo, the rest are just from nearby Bay Area locations. Setting the result radius to its lowest value, 10 miles, doesn’t really help either. The only way to actually narrow down results to a specific city is to add a field like this to the request that looks like this: “&f_PP=102571732”, where 102571732 is LinkedIn’s ID for that city.

Given the constraints, there’s a multi-step process to scraping LinkedIn. First, we query for location strings like “San Mateo” to get the IDs for San Mateo as well as nearby cities. As we go through all location strings, we keep track of unique city ID’s we’ve seen. Then, we make a request for each permutation of city ID and search strings, and keep track of unique job postings we find (by comparing both company and the title of the position). Lastly, we make a request for each job posting to gain access to the full job description so we can filter based on it (for example, ignoring all unpaid internships by adding “unpaid” to our blacklist for words within the description).

I hope you found this useful or interesting. I’m personally planning on using this scraper to help with applying to internships/full-time opportunities for the upcoming year. Be sure to check out the project repo as well!

--

--