Efficient Text Scanning: How to Quickly Process Text

An overview of text scanning

Matthew Roever
Level Up Coding

--

Timelapse of highway through the mountains.
Photo by Edgar Chaparro on Unsplash

Scanners are a specialized tool to quickly identify patterns in text. They are a fundamental component in compilers, JSON parsers, and text processors. The scanner I am presenting is designed to process source code for a compiler, but the code and topics I will cover can be easily adapted for other uses.

How do Scanners Work?

Scanners use a series of patterns to produce lexemes. Each lexeme is a continuous substring of the input text. The lexemes are then selectively paired with a tag corresponding to the pattern. A lexeme-tag pair is called a token. There may or may not be a token for each lexeme depending on what patterns are relevant. More on this later.

Scanners operate by finding a pattern that accepts the current character. The scanner will then process the pattern until the text no longer matches. At this point, a lexeme has been identified. A list then records the location of each lexeme, and a token is created for patterns that are relevant to further processing. The scanner will then repeat the process of pattern finding. This continues until all text has been scanned.

Patterns

A pattern defines an acceptable sequence of characters. The pattern can look for an explicit order of characters: int is i + n+ t. It can also look for a general sequence like a regular expression. Figure 1 shows the pattern for numbers in JSON.

JSON number format from json.org [https://www.json.org/json-en.html]
Figure 1: Pattern for numbers in JSON — by json.org

Regular Expressions and Finite Automata

A regular expression defines a pattern using a set of rules stored in a string. The regular expression “[0–9]” will match with any number from 0 to 9.

However, regular expressions are not commonly used when making scanners. This is due to the complexity in acceptable patterns. Take the JSON number example in Figure 1. Producing a regular expression for it is possible, but it will certainly not be easy. Regular expressions also come with extra time overhead.

Finite automata are used to overcome the complexity and time penalty of regular expressions. There are multiple types of finite automata. The form most relevant to scanners are deterministic finite automata (DFA). A DFA is usually a hand coded function customized for each pattern. DFAs produce the same results as a regular expression, but they are faster and clearer.

The DFA searches for acceptance states in the text. In the JSON example, the acceptance states are number, fraction, and exponent. When the DFA finds an acceptance state it will record its current position. It then tries to progress to the next acceptance state. Once it can progress no further, the DFA returns the last accepted state and performs any necessary rollbacks.

Processing Numbers

First let’s cover how to produce a deterministic finite automata for numbers. We will use the pattern shown in Figure 1 as our guide. When processing a number, it can come in several forms. It can be an integer — 123. It can be a decimal — 4.2. Or it can be an exponent — 0.987e6.

The first step is to observe repetitions in the pattern. We can see that a decimal or exponent will contain integers. A decimal will be an integer.integer. An exponent will be an integer.integer(e|E)[+|-]integer. With this knowledge, we can create a tiered pattern.

The pattern initially checks to see if a number is the lead character. If so, a pattern match has been found, otherwise the scanner must continue to check for patterns. In this scanner, differences between the input and returned index indicate whether a pattern was found.

Once it is known that a number is present the integer pattern is invoked. An integer can consist of numbers from 0 to 9. Additionally, underscores are permitted within the numbers to act as separators. For example, a large number could be written as 1_000_000. Unlike the JSON pattern, signs at the front of the number will not be checked for.

A decimal can follow an integer, and an exponent can follow a decimal. If the character following the integer is a period, the number is a decimal. Following the period could be another integer or an exponent.

When checking for an exponent it is critical to ensure the pattern is complete. The number 12.3e- is incomplete. When this occurs, the pattern must rollback. It will accept the greatest number of characters that match an acceptance state in the pattern. In this case, that would be 12.3. If you know that no other acceptable pattern could follow, the scanner could report an error. For compilers, it is important to process as much input as possible before stopping. This way more errors can be identified in a single pass.

Scanning Strings

Strings are one of the easiest components to process. “A string looks like this.” All characters contained within the quotes are combined to produce the string. The only tricky part occurs when “a string contains a “ within the string”. The usual solution is to escape the quote. This entails placing a backslash before the quote. Now the string will be “a string contains a \” within the string”.

All the pattern must do is verify the initial character is a quote. Then it looks for an unescaped quote later in the input text. The pattern also checks to see if the last character is a backslash. If it is, the string continues to the next line. A separate pattern handles strings that cross multiple lines. It should also be mentioned that the validity of the quote is not checked. The pattern will not detect unterminated quotes. As with the number pattern, this check is omitted to delay the identification of errors.

Identifying Operators

Operators are mathematical symbols that can be one or more characters long. To understand how operator scanning works, let’s look at +, ++, and +=. The least efficient way would involve looking for the longest operators first. A better way is to use an operator hierarchy.

Operator hierarchy beginning with +
Figure 2: Operator DFA beginning with + — by author

The operator hierarchy records the relationship between operators. Each operator has a character and tag. It also has a list of operators that come after it in the hierarchy. The first operator will be < + : PLUS >. Under plus will be two more operators: < + : INCREMENT > and < = : PLUS_EQUAL >.

To process the hierarchy, the scanner matches the operator character. When a match is found, the current operator type is updated. If more operators exist, the scanner calls itself recursively, otherwise it returns.

Identifying Keywords (or Fast Text Matching)

Identifying specific words — whether they are keywords, names, or identifiers — has a simple solution. Extract the word from the input text, thereby making a new string. Then compare that string to all possible keywords/names/what-have-you!

Wait? You think this will be slow and inefficient?

Well, that’s because it is. So then, what would be a better way to go about comparing text?

The most efficient way to compare text is to not compare text at all. Instead, each word is hashed. Then the hashes can be compared using a binary tree or hash table. I prefer to use a binary tree, that way I can use my own hashing function without having to make a custom hash table.

To compute the hash, I rely on the CRC32 (cyclic redundancy check — 32 bit) function. The function was created to detect errors in data streams. The main benefit of using CRC32 over another hashing function is its native hardware support. CRC32 is a hardware intrinsic function. It can be used on any machine supporting the SSE4.2 instruction set extensions. According to the Steam Hardware Survey, 96.15% of the CPUs in their report support SSE4.2.

Using an intrinsic function is not as pretty as most unfortunately. First off, the function is placed in the intrin.h C header. There are four different forms of the CRC32 algorithm that are natively supported. Since we are hashing a string the best option is to use _mm_crc32_u8(hash, u8_char). hash is the current CRC result value, and u8_char is the current byte of text that has been cast to be an unsigned byte.

Once a word has been hashed, all that’s left is comparison of the hashed values. When matching hashes are found the words are likely to be the same. However, they are not guaranteed to be the same. As with all hash functions, collisions can occur. At this point we return to basic string to string comparison. But this time the number of possibilities has been greatly narrowed, ideally to 1.

The Scanner Loop

Flowchart of main scanner loop
Figure 3: Main scanner loop — by author

Each DFA is designed to return the index where the text no longer matches the pattern. If the pattern did not match, the returned index will equal the index given to the DFA. When this occurs, the loop moves to the next possible DFA.

Once a match is found, the returned index will be different than the input index. At this point, the scanner will decide whether this pattern is needed by a parser (or another step that uses the scanner output). If the pattern is important, a token is produced. If the pattern is not important the lexeme is still recorded, but no token is produced.

The scanner then restarts the loop with the first DFA. Restarting the loop preserves the order of importance of DFA rules. It also enables the creation of simpler DFAs with fewer branches.

Examples of Scanner Output

Now that we have covered how to build a scanner. Let’s examine the output it will produce. Consider the following example:

let x = 3
let y = x + 2 * (9 - x)

When we run the scanner, it will produce 28 lexemes. Each lexeme records the starting position within the input text and the index that is one passed the end.

As the scanner produced the lexemes, it also decided which were relevant to later steps. In the programming language I am implementing, whitespace is not needed for parsing. Therefore, it is excluded from the set of tokens. To track where one line ended and another began, a special EOL (end of line) token is inserted at the end of each expression. In the end, the scanner produced 18 tokens from the 28 lexemes.

Performance Tests

When processing lots of text it’s important to do so as quickly as possible. To put the scanner through its paces, I used a series of long C header files. The headers come from the STB library.

The performance test is not a rigorous benchmark, but it gives a rough estimate for the scanner performance. All times are measured using wall clock time. I had other things running on my machine and no effort was made to prioritize the scanner. For each file, 30 tests were run to produce the average and standard deviation. The test was performed on an Intel I7 processor at 2.60 GHz.

Performance metric for the scanner.
Figure 4: Performance analysis of the scanner — by author

This article is part of an ongoing series about compilers. The goal is to produce a new programming language. Next, we will cover how to produce a scoped symbol table which is required to build a parser.

Next in Series: Implementing a Scoped Symbol Table (coming soon)
Previous in Series:
Inside a Compiler
Series Overview:
Creating a Safe Language

Code for this article and the compiler series is available on GitHub.

--

--

I am a Civil Engineer that codes. I write about compilers, computer graphics, and entrepreneurship.