Unlocking the Power of Distributed Systems: A Comprehensive Guide to Consistent Hashing

Consistent Hashing in Action

A Deep Dive into the Inner Workings of Consistent Hashing for Distributed Systems

Arpit Jindal
Level Up Coding
Published in
5 min readSep 11, 2022

--

Consistent hashing is used in distributed systems to keep the hash table independent of the number of servers available to minimize key relocation when scale changes occur.

This article explains consistent hashing, what it is, and why it is an essential tool in scalable distributed systems.

What is Hashing?

Hashing is the process of mapping one piece of data(An arbitrary size object) to another piece of data of fixed size(An integer), known as hash code or simply hash. A function that is used for mapping objects to hash code is known as a hash function.

Example of Hashing

Hashing in a distributed system

In a scenario where various programs, computers, or users request resources from multiple server nodes, we need a mechanism to map requests evenly to available server nodes, thus ensuring that the load is balanced for consistent performance.

Suppose three servers are A, B, and C, each will have an equal number of keys. If we need to store a new key, we can do the same and store it in one of the servers depending on the output of server = f(x)% 3.

Distribution of objects to 3 available servers

Rehashing

It’s common for a cluster to scale up and down, and there are always unexpected failures in a distributed system. We cannot guarantee that the number of server nodes will remain the same. What if one of them fails? With a naive hashing approach, we need to rehash every single key as the new mapping is dependent on the number of nodes and memory locations.

Rehashing to match available servers impacts all the keys.

The problem in a distributed system with simple rehashing — moving the placement of every key — is that the state is stored on each node.

A small change in the cluster size could result in a reshuffle of all the data in the cluster. As the cluster size grows, this becomes unsustainable because the amount of work required for each hash change grows linearly with cluster size.

This is where the concept of consistent hashing comes in.

What is consistent hashing?

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.

Imagine we mapped the hash code range onto the edge of a circle. That means that the minimum possible hash code would correspond to an angle of zero, the maximum possible hash code would correspond to an angle of 2𝝅 radians (or 360 degrees), and all other hash values would linearly fit somewhere in between. So, we could take a key, compute its hash, and find out where it lies on the circle’s edge.

For example, we can map the above data among 3 servers like this:

What happens when a server is added or removed?

The main benefit of Consistent Hashing is that any server can be added or removed dynamically and only the minimal set of objects need to move.

On average, it requires only k/n objects to move where k is the total number of keys and n is the total number of nodes.

This property is known as monotonicity: when a server is added, objects only move from old servers to the new server; there is no unnecessary movement between the old servers.

  • Scale-Up Process: When a new server is added to the cluster, the Server hash is calculated and it is placed on the hash ring. Keys are reassigned to redistribute the load, but as we can see only a minimum number of keys are reassigned:
Mike and Lucy are now moved from Server C to new server D

Below gif depicts the process of keys movement in case of cluster scale-up:

Some Keys are reassigned to a new server when the cluster is scaled up.
  • Scale-Down or Node failure: When a server/Node goes down, only the keys which belong to that node will be redistributed to the nearest available servers on the hash ring without impacting other keys/objects:
As Server D is removed, Mike and Lucy are now re-assigned to server C.

Below gif depicts the process of keys movement in case of cluster scale-down:

Keys are reassigned to the nearest available server when a node goes down.

Getting an even distribution of keys across Servers

In an ideal case, when there are k keys and n servers, every server must have close to k/n keys. Thus, the addition or removal of a node can impact a maximum of k/n keys in the system. To ensure that there is near ideal distribution, we introduce virtual nodes in the system. Every physical node has multiple virtual nodes on the hash ring.

Multiple virtual nodes for each server to evenly distribute keys across available servers.

Conclusion

Consistent Hashing has several wide applications. It is used for sharding, load balancing, etc. It enables horizontal scalability. It minimizes disruption in a dynamic environment.

It’s used in Amazon’s Dynamo DB as a partitioning component. Further, open-source applications such as Apache Cassandra and Voldermort use it for data partitioning.

References

--

--

Consultant Architect @Sopra Banking Software | Designing Systems At Scale | Java Developer | Quick Learner | Technophile