Learning Rust: Hashing

Adrian Macal
Level Up Coding
Published in
6 min readJul 25, 2023

--

Ever pondered on how to deduplicate massive amounts of binary data?

Data hashing is a fundamental concept used in computer science. It is primarily used in integrity checking, encryption and deduplication. It relies on hash determinism. This means that no matter how many times you run the same data through the hash function, you will always get the same hash value.

Apart from determinism, hash functions have additional properties like fixed size, the avalanche effect, uniformity, resistance to collisions and resistance to deriving the original input. However, it is important to note that eventually two values may produce the same hash, according to the pigeonhole principle.

One of the well-known hashing algorithms is SHA-512. In Rust, it can be utilized from the ring crate as follows:

use hex;
use ring::digest;

fn main() {
let mut context: digest::Context = digest::Context::new(&digest::SHA512);
let data: Vec<u8> = vec![0;10];

context.update(data.as_ref());

let digest: digest::Digest = context.finish();
println!("{}", hex::encode(digest.as_ref()));
}

The code constructs a new algorithm, feeds the context in one iteration, and finalizes the value. If the data is large, it can update the hash in smaller chunks.

The lesser-known function is GEAR, which is fundamentally different from SHA-512. It belongs to the family of rolling hash functions, which means that it can compute the hash value of a long sequence of data in a rolling manner within a fixed-size window. Instead of processing the entire dataset at once or in chunks, a rolling hash function can calculate the hash value of the next part of the data based on the current hash value and the new part of the data.

Let’s see the rolling function in action. The GEAR function uses a 64-byte window, so theoretically we can confirm this using the following code. The code uses gearhash crate.

use gearhash::Hasher;

fn main() {
let mut hasher = Hasher::default();

for i in 0..70 {
hasher.update(&[0]);
println!("{}: {}", i, hasher.get_hash());
}
}

This code aims to incrementally feed the function with zeroes. Each iteration prints the current hash. It is expected that the hashes start repeating after 64 iterations. And this is exactly what happens. Why? Check out the internal implementation:

impl<'t> Hasher<'t> {
/// Create a new hasher with the given table.
pub fn new(table: &'t Table) -> Self {
Self { table, hash: 0 }
}

/// Update the hash state by processing all the bytes in the given slice.
pub fn update(&mut self, buf: &[u8]) {
for b in buf.iter() {
self.hash = (self.hash << 1).wrapping_add(self.table[*b as usize]);
}
}
}

The implementation shifts the previous hash to the left, slowly forgetting what has been computed so far. After 64 shifts, the entire memory of the previous state is gone.

Both functions can work together to create something beautiful called Content-Defined Chunking. It is a concept predominantly used in data deduplication, providing a solution to the limitations of offset-based chunking. It addresses the issue of inserted or removed bytes in binary files, which would shift data and completely disrupt the naïve algorithm.

Consider the actual data dumps from Wikipedia. Every month, several terabytes are exported into large XML files. Let’s take a file named ‘enwiki-20230701-pages-meta-history1.xml-p1p844.bz2’ as an example. After unpacking, it’s huge. We can also look at some historical versions of it:

 2486879067 enwiki-20230501-pages-meta-history1.xml-p1p844.bz2
2498312992 enwiki-20230601-pages-meta-history1.xml-p1p844.bz2
2517686076 enwiki-20230701-pages-meta-history1.xml-p1p844.bz2

36641858740 enwiki-20230501-pages-meta-history1.xml-p1p844
36762597370 enwiki-20230601-pages-meta-history1.xml-p1p844
36958681345 enwiki-20230701-pages-meta-history1.xml-p1p844

All three text files contain more or less the same articles, but newer files are very likely to include insertions of new article revisions. Changes are made to files somewhat randomly, probably multiple times and somewhere in the middle. Let’s try to apply Content-Defined Chunking to them.

The key idea behind Content-Defined Chunking is to consistently identify certain bytes that can be used to partition the files into chunks. The rolling hash is an ideal candidate for this. When applied, we can compute a hash for each 64-byte long slice of the file. However, merely computing a hash isn’t sufficient to determine where to make the cut. We can’t cut at every hash value, and we can’t make cuts based on a single well-known hash value either. The objective is to cut the data when the hash value matches a given bit mask. The width of this bitmask determines the average size of the chunks, meaning that each chunk can vary in size.

Let’s demonstrate this with code that determines the cuts using the GEAR hash and additionally hashes the split content using the SHA-512 algorithm.

use std::fs::File;
use std::io::Read;

use gearhash::Hasher;
use ring::digest;

fn main() {
let mut context = digest::Context::new(&digest::SHA512);
let mut hasher = Hasher::default();
let mut file = File::open("/home/vscode/enwiki-20230701-pages-meta-history1.xml-p1p844").unwrap();

let mask = 0x0000_0000_0000_ffff;
let mut buffer = vec![0; 1048576];

let mut total = 0;
let mut offset = 0;
let mut previous = 0;

while let Ok(count) = file.read(&mut buffer) {
while count > offset {
match hasher.next_match(&buffer[offset..count], mask) {
None => {
context.update(&buffer[offset..count]);
offset = count;
}
Some(boundary) => {
let next = total + offset + boundary;
let diff = next - previous;

context.update(&buffer[offset..offset + boundary]);

if diff >= 1048576 {
let digest = format!("{:?}", context.finish());
println!("{:>16x} {:.20} {:>10}", hasher.get_hash(), digest, diff);

context = digest::Context::new(&digest::SHA512);
hasher.set_hash(0);
previous = next;
}

offset += boundary;
}
}
}

if count == 0 {
break;
}

total += count;
offset = 0;
}

if offset > 0 {
let digest = format!("{:?}", context.finish());
println!("{:>16x} {:.20} {:>10}", hasher.get_hash(), digest, total - previous);
}
}

This code reads the file in 1MiB chunks. The rolling hash determines the cut point, and the content is also hashed on-the-fly with SHA-512. When a boundary is found, basic data are output to the console, including the rolling hash, SHA-512, and chunk length. You may also notice that small chunks are skipped to reduce noise. The idea of combining small chunks is described in the FastCDC research paper.

What does the Rust program print? It produces a kind of metadata file describing how the file is split and how it could be combined from multiple chunks. We are not actually going to chunk it and search for duplicates. Instead, we are going to apply this approach to all three files and see how they are similar or different. Here is the beginning of the output after processing each of those files:

-- enwiki-20230501-pages-meta-history1.xml-p1p844
11c21798b7100000 SHA512:448c81907bc00 1596779
dc9880249fb60000 SHA512:a585d928c1a5e 1677422
85dc9161e75e0000 SHA512:2a89a71c2c34d 9496454
85dc9161e75e0000 SHA512:9489a827f71c0 1072177
85dc9161e75e0000 SHA512:c608810d7c822 1078175

-- enwiki-20230601-pages-meta-history1.xml-p1p844
11c21798b7100000 SHA512:f217309751389 1596780
dc9880249fb60000 SHA512:a585d928c1a5e 1677422
85dc9161e75e0000 SHA512:2a89a71c2c34d 9496454
85dc9161e75e0000 SHA512:9489a827f71c0 1072177
85dc9161e75e0000 SHA512:c608810d7c822 1078175

-- enwiki-20230701-pages-meta-history1.xml-p1p844
11c21798b7100000 SHA512:e65b8c26229b7 1596780
dc9880249fb60000 SHA512:a585d928c1a5e 1677422
85dc9161e75e0000 SHA512:2a89a71c2c34d 9496454
85dc9161e75e0000 SHA512:9489a827f71c0 1072177
85dc9161e75e0000 SHA512:c608810d7c822 1078175

We can notice that each of the files has the same cutting points, but the first chunk differs in each file. Chunks 2–5 remain the same. If we analyze the entire file with a diff tool, we can see that 230 chunks out of 27k chunks are different. The algorithm invalidated only about 1% of all chunks. That’s impressive, considering the fact that we analyzed only binary files.

You may wonder how this is useful. Well, let’s consider a common scenario where we want to store backups of data over time. If we used a traditional backup strategy, we would simply store a complete copy of the data at each point in time, leading to significant data duplication and inefficient use of storage.

However, by using a technique such as Content-Defined Chunking, we can take advantage of the fact that much of the data remains the same between different versions of the file. We can store each unique chunk only once, and then for each version of the file, just store a list of references to the chunks.

Moreover, when we need to transfer these backups over the network, we only need to transfer the new chunks that were not present in the previous version, reducing the amount of data that needs to be transferred.

This approach not only saves storage space, but it also reduces the time required for data transfer, and optimizes the processing power necessary for managing backups.

If you are interested in exploring this code further, please don’t hesitate to visit the repository at https://github.com/amacal/learning-rust/tree/rolling-hash. You’re more than welcome to take a look around!

Level Up Coding

Thanks for being a part of our community! Before you go:

🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job

--

--

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.