How to Create a Threaded Web Scanner in Python

A simple project that can be completed in about 100 lines of code.

Published in

Level Up Coding

7 min readMay 21, 2020

The great thing about Python is that it makes developers’ lives easy — import a couple of libraries to do the hard stuff for you, and you’re off to the races. This holds true when creating a threaded web scanner that’s capable of making multiple concurrent requests — with Python it’s easy to accomplish in a short amount of dev time.

Multiple threads going thru a needle — Photo by amirali mirhashemian on Unsplash

In this post, I’ll explain how to create a threaded web scanner in Python that uses urllib3 — a powerful thread-safe HTTP client that can be installed via pip. Here I’ll primarily focus on the library’s usage plus how to implement threading. The mundane aspects like argument parsing and IO can be seen in the completed script which I’ll share at the bottom. All in this can be done in about 100 lines, making it a quick project to complete.

The script will take an argument or list of hosts, plus a list of paths to search for, and then output results where matches were found based on targeted HTTP status codes. This makes it easy to see what files and folders are present (HTTP 200) or missing (HTTP 404) on a given site. This can be helpful to detect broken links and missing resources after migrating a website or as part of routine maintenance.

Web scanning also has security applications, where the scanner can be used to detect resources that shouldn’t be accessible to the public, perform application fingerprinting, or even be adapted to detect things like SQL injection by making multiple requests with arguments to an endpoint until an HTTP 500 is thrown, indicating a successful injection by triggering an error.

Basically, if you want an automated and efficient way of performing operations against a web server, this is where you start.

Managing Connections

In urllib3, connections to a single host are managed by a ConnectionPool. Multiple pools are managed by a PoolManager. These higher-level abstractions let you provide kwargs that get passed to the lower levels, making the whole stack easy to instantiate from the PoolManager level. Parameters to control concurrency, timeouts, headers, proxy configuration, etc. can all be done in one spot. Refer to the docs if you want to dig into the nitty-gritty of what can be configured or for an explanation of what a specific parameter does.

I’m going to use a const called THREADS to specify the (maximum) number of pools num_pools and simultaneous connections maxsize that can be active at once. I’m also setting block=True which will cause these limits to be enforced globally in a threaded scenario. Doing this effectively defines the maximum number of concurrent connections that can be active at once. This will let our scanner be efficient regardless of whether it’s making one request to ten hosts, ten requests to one host, or large multiples of both (without causing a flood).

Let’s see what we’ve discussed so far in code:

We now have configured our http PoolManager such that it will self-manage concurrent connections to multiple hosts, and we’ve also set the timeout, allowed for one retry per attempted connection, and specified that we don’t want it to follow redirects. We’ll initiate all our connections through this pool going forward.

Making Requests

Making a request is fairly straightforward. I’m going to put the logic in a simple function that takes a URL and returns a tuple of the URL and resulting status code. We’ll be calling this function when threading:

First, I’m being very liberal with the exception handling. The request can fail because the host is down, TLS errors*, timeouts, etc. Feel free to implement the many exception types that urllib3 has. You can also take a more generic approach which still provides a robust output of what went wrong. It also may make sense to check if all the targeted hosts are up before starting a scan to omit any unreachable hosts. You could also abandon scanning a host after a certain number of errors — just make sure you implement this in a thread-safe way.

Second, notice that I’m using functools to assign the flush=True argument to all my print() calls in the script. This is to effectively turn off output buffering so that we see the live progress of our scan as it happens given the threaded environment.

All in all, a simple function — use the http PoolManager to handle our connections, catch some errors, then return the results. Now it’s time to run it in parallel.

*If you want to ignore certificate errors in particular, look at this, urllib3.disable_warnings(), and take note that you’ll likely have to split HTTP and HTTPS connections into two different pools due to how urllib3 deals with kwargs.

Threading

Let’s jump straight into the code:

I love Python. Isn’t that easy? A single import, three lines of code, and we’re running our request() function from above in parallel. Thankfully, urllib3 is smart enough to throttle and manage the web requests as they come in from our threads based on how we initialized it. Notice that the THREADS const from above is used again here to set worker count. The executor.map effectively calls request() for each entry in the urls list and produces a generator that will output the results that preserves the call order. The last line will cause our script to wait until the threads have completed before proceeding.

It should be noted that, if trying to abort the scan early via ctrl+c, the script will effectively hang until completion and/or you kill the related process. This can be fixed, but the tl;dr of it is that it’s kind of awkward and will involve some refactoring — I chose not to bother.

Putting it to use

First, how can you tell if this is working properly — ie. is it actually running THREADS connections in parallel? What I did was spin up a simple PHP server that had a sleep.php file on it which would sleep for 5 seconds and then return. Given the print() debugging statements in request(), it’s easy to track the progress as the scan runs, tweak some variables, and see what’s happening in realtime. You can simulate multiple hosts by creating local DNS entries for example0.com, example1.com… example9.com all pointing to a localhost server, so you don’t end up flooding a live site with debugging traffic. You can also test threading against a single example host by requesting the same sleep file multiple times.

Here’s a (truncated) example of the completed script scanning 10 “hosts” each for two files, one which doesn’t exist, and the other which sleeps for 5 seconds. It’s set to report only the found files (ie. HTTP 200 and omit HTTP 404):

$ time python pywebscan.py hosts.txt paths.txt
Scanning 10 host(s) for 2 path(s) - 20 requests total...------ REQUESTS ------http://example0.com/pywebscan-test/does_not_exist.txt 404
http://example2.com/pywebscan-test/does_not_exist.txt 404
...
http://example0.com/pywebscan-test/sleep.php 200
http://example3.com/pywebscan-test/sleep.php 200
...------ RESULTS ------http://example0.com/pywebscan-test/
---
http://example0.com/pywebscan-test/sleep.php 200...http://example9.com/pywebscan-test/
---
http://example9.com/pywebscan-test/sleep.php 200------ SCAN COMPLETE ------real    0m5.193s
user    0m0.000s
sys     0m0.000s

The whole thing completed in just over 5 seconds, and you’ll notice the request responses weren’t sequential based on hostname, so threading is working. Due to the sleeps, if done in serial, it would have taken just over 50 seconds. Setting our THREADS const to 1 confirms this:

$ time python pywebscan.py hosts.txt paths.txt
Scanning 10 host(s) for 2 path(s) - 20 requests total......------ SCAN COMPLETE ------real    0m50.309s
user    0m0.000s
sys     0m0.015s

Concept proven. Feel free to test other scenarios (like a large volume and various combinations of requests) to make sure everything is kosher in your implementation.

The completed script

It can be found here. It’s bare-bones, and only takes two arguments — host or host file, plus a paths file. Parameters such as thread count, timeout, etc. are just hardcoded into the script — easy to turn into CLI arguments as needed. There’s some logic to handle the host and path parsing, output formatting, and not much else that wasn’t covered. Lean and adaptable as needed.

Next steps and advanced usage

I’ve mentioned several improvements so far that could be made: better error handling, graceful abort behaviour, and more and customization via arguments. Here are some other points to consider:

Handling and following redirects — I’ve turned them off to keep things simple. A lot of SPA apps may use .htaccess to pipe all “not found” requests into a single entry point which can cause some strange looking results when allowing redirection during a scan. Redirects may also be in place to send HTTP traffic to HTTPS, and there are multiple types of redirects and rewrite types that can be in place. Keep this in mind for your use case.
Python makes it fairly easy to do reverse dns lookups — ie. convert an IP into a hostname. This may be useful if you’re working from an IP list, especially if virtual hosts and/or HTTPS is involved.
The request() function gets the whole response, not just the status code. You can parse it, search the content, extract links, etc.
You can configure urllib3 to use a proxy, and also specify the user agent if desired.
You can tune the THREADS parameter to increase the number of concurrent connections — just be mindful of socket/resource use when doing so — and try not to flood too many connections to a single host at once.

Wrapping up

There you have it — a threaded Python web scanner in about 100 lines of code, most of which are just for handling execution flow. It’s versatile and easy to extend for whatever your use case is. We’ve seen that urllib3 is very powerful yet easy to use, and that Python’s ThreadPoolExecutor makes threading a breeze. If you’re interested in parallelism in Python, I recommend reading this post which breaks down the theory and different approaches at a high level.

And if you liked this writeup, I recently wrote a similar article about how to create a DIY web scraper to crawl and extract information from a website that you may find useful as well.