Building a synonym searcher 🔍 in Rust with tokio, select, and reqwest

robert barlin
Level Up Coding
Published in
6 min readFeb 1, 2021

--

Naming things is a notorious task as a developer and it’s a constant struggle. I often have two or more browser tabs open in order to find the words I’m looking for. And then one day it hit me.

Wouldn’t it be nice to have a synonym searcher in the terminal?.

With no time to waste, all other side-projects were abandoned in an instant and my path to a better named world began.

Photo by Tolga Ulkan on Unsplash

Goal 🥅

It’s always good to begin by figuring out what we want to achieve.
In the spirit of trying to finish this, let’s try to keep the scope relatively small.
The command line tool doesn’t have to be more complex than this:

$ cargo run <SOME-WORD>suggestion 1
suggestion 2
suggestion 3
...
suggestion 10

Finding the right words 💬

So where can we find the synonyms?

I usually google the word I want to find synonyms for and click my way through to some of the websites. I realized that I usually end up on these three websites: thesaurus.com, yourdictionary.com and merriam-webster.com.

A simple solution would be to scrape these sites and combine the results. But in order to find the page we want to scrape we need a reliable way of querying each site.

After some investigation, it turns out that, each site uses the searched word in the URL as a path parameter. That’s perfect :)

To find synonyms for the word “car”, the URL for each site would be:
https://thesaurus.yourdictionary.com/car
https://www.thesaurus.com/browse/car
https://www.merriam-webster.com/thesaurus/car

Scrape the way 🛣️

In order to extract data from each web page, we first need a way of making HTTP-requests for fetching a web page.
For this I’ll be using the HTTP-crate: reqwest.

After fetching a web page, we need some way of parsing and selecting pieces of data from the HTML. This can be done with the select crate.

I won’t go over how the select crate works, but I’ve written about the library before, so you can read about it here if you want.

Finding the key 🗝️

Next step is to figure out how to find the important pieces of data in each of the websites. This means figuring out the structure of the HTML so we can reliably traverse our way through and find the same elements and data every time.

Here is some selector syntax showing one possible solution for each web page.

Merriam Webster: .syn-list .mw-list > li > a
Thesaurus: [id="meanings"] li
Your Dictionary: .synonym-link

Go fetch 🐶

Alright, let’s get this keyboard bashing journey on its way!

The first obstacle to climb is to fetch the content from each website. How do we do this with reqwest?

The reqwest crate is split into two parts: async and blocking. Even though we’ve promised to do this with async in Tokio, we’ll start out with the blocking client and introduce async later on.

The code for fetching a website and extracting its body is pretty simple.
let body = reqwest::blocking::get("http://some.url")?.text()?;

You can read more about the get function here.

In the select crate-world you create a Document which represents the HTML you want to search and extract from. This is true, independent of which website we will be working with. So we could start with creating a function for fetching a website and transforming it into a Document.

Being picky ⛏️

We’ve already figured out where to look for our data in each web page, now we just need to jot down the appropriate select code. Let’s create a separate function for each website.

The function will take a &strand return a Vec<String>.

Here are the three implementations:

All three functions are using this little helper function for extracting out the text from a select Node object.

Chain chain chain ⛓️

With these three separate web-scraping functions we end up with three lists filled with Strings. A first step would be to try and combine all the lists into one big list.

By converting our Vec<String> values into Iterator instances, this enables us to use the .chain method. Simple as that :)

The new order 🎖️

If we were to use our combined result as is, we would notice that the ordering of the synonyms would be a bit messy.

The websites we are using seem to display the most relevant synonyms at the top of the list. Likewise, the words that are a more far-fetched end up at the bottom of the list.

Our current solution will display all words from res_1, ordered from most relevant to least relevant, and then show the order result for res_2 and so on.

We need a way of keeping track of the order the synonyms were found in.

Since we are using iterators in the fetching functions we could make use of the .enumeration method. It would be a small incision and gives what were after. The method would change our end result from Vec<String> into Vec<(usize, String)>.

Here is an example:

With .enumeration we now get a result like this for the word “hello”:

(0, greetings)
(1, hi)
(2, howdy)
(3, ...)
...

Now we can sort on the enumeration value, where the lowest value equals the best match: result.sort_by(|(a, _), (b, _)| a.cmp(b));

Double vision 🐫

We can now combine our results and sort our combined list, but another problem arises…

What about duplicates?
A scenario that most likely will occur, is where the three different fetch results may include the same words. Sometimes the words are in the same enumerated position, and sometimes not.

Source 1
(0, Wonderful)
(1, Perfect)
Source 2
(0, Brilliant)
(1, Wonderful)
Source 3
(0, Wonderful)
(1, Nice)

The word “Wonderful” in this case is ranked first according to source 1 and 3, but comes in at second place according to source 2.

There are probably some smart solutions to solve this, but let’s keep it simple :)

The first solution that sprung to my mind was to group each word and sum their enumeration value. In this example, “Wonderful” would get a score of 1=(0+1+0).

Let’s use a HashMap<String, usize> for grouping the calculation.

Multitasking ⚙️

We are almost done. It’s time to drop in the long awaited super-hero Tokio. Tokio is an async runtime which allows us to easily express and run things concurrently.

An issue with our code right now is that we’re making one HTTP request after the other. These requests could be run at the same time, with the help of Tokio and the async get function from reqwest.

The nice part is that it won’t require to much work from us to change it.
Here is a list of things that will happen:

  • Make our main function async with the #[tokio::main] macro.
  • Convert our three fetch_* functions to async
  • Convert our fetch_website to async and use the async get from reqwest
  • Call each fetch_* and wait for all to complete using Tokio’s join! macro

… and here are some code examples:

Main function signature:

Fetch document function:

Fetch for one of the synonym sites:

Using Tokio’s join! macro for waiting for all async functions to complete:

Polish ✨

The list of combined results can end up being quite long, so let’s limit the output by taking the top 10 synonyms. And finally, output the result.

It’s a wrap 🌯

Running cargo run improve will now give you a list of synonyms :)

When testing it out I realize that the results sometimes feel a bit off. So some improvements can most certainly be made, but hey, it would be kind of boring if everything was this easy!

There’s a lot of unhandled errors here, but for the sake of this post not going on and on, I’ll leave it up to you or a future post to dig into that.

Here are links to versions of the synchronous and asynchronous examples:

Thanks for following along in my evening session of Rust-ing around.

/Robert

--

--