Unlocking the Power of Distributed Systems: A Comprehensive Guide to Consistent Hashing
Consistent Hashing in Action
A Deep Dive into the Inner Workings of Consistent Hashing for Distributed Systems
Consistent hashing is used in distributed systems to keep the hash table independent of the number of servers available to minimize key relocation when scale changes occur.
This article explains consistent hashing, what it is, and why it is an essential tool in scalable distributed systems.
What is Hashing?
Hashing is the process of mapping one piece of data(An arbitrary size object) to another piece of data of fixed size(An integer), known as hash code or simply hash. A function that is used for mapping objects to hash code is known as a hash function.
Hashing in a distributed system
In a scenario where various programs, computers, or users request resources from multiple server nodes, we need a mechanism to map requests evenly to available server nodes, thus ensuring that the load is balanced for consistent performance.
Suppose three servers are A, B, and C, each will have an equal number of keys. If we need to store a new key, we can do the same and store it in one of the servers depending on the output of server = f(x)% 3.
Rehashing
It’s common for a cluster to scale up and down, and there are always unexpected failures in a distributed system. We cannot guarantee that the number of server nodes will remain the same. What if one of them fails? With a naive hashing approach, we need to rehash every single key as the new mapping is dependent on the number of nodes and memory locations.
The problem in a distributed system with simple rehashing — moving the placement of every key — is that the state is stored on each node.
A small change in the cluster size could result in a reshuffle of all the data in the cluster. As the cluster size grows, this becomes unsustainable because the amount of work required for each hash change grows linearly with cluster size.
This is where the concept of consistent hashing comes in.
What is consistent hashing?
Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.
Imagine we mapped the hash code range onto the edge of a circle. That means that the minimum possible hash code would correspond to an angle of zero, the maximum possible hash code would correspond to an angle of 2𝝅 radians (or 360 degrees), and all other hash values would linearly fit somewhere in between. So, we could take a key, compute its hash, and find out where it lies on the circle’s edge.
For example, we can map the above data among 3 servers like this:
What happens when a server is added or removed?
The main benefit of Consistent Hashing is that any server can be added or removed dynamically and only the minimal set of objects need to move.
On average, it requires only k/n objects to move where k is the total number of keys and n is the total number of nodes.
This property is known as monotonicity: when a server is added, objects only move from old servers to the new server; there is no unnecessary movement between the old servers.
- Scale-Up Process: When a new server is added to the cluster, the Server hash is calculated and it is placed on the hash ring. Keys are reassigned to redistribute the load, but as we can see only a minimum number of keys are reassigned:
Below gif depicts the process of keys movement in case of cluster scale-up:
- Scale-Down or Node failure: When a server/Node goes down, only the keys which belong to that node will be redistributed to the nearest available servers on the hash ring without impacting other keys/objects:
Below gif depicts the process of keys movement in case of cluster scale-down:
Getting an even distribution of keys across Servers
In an ideal case, when there are k keys and n servers, every server must have close to k/n keys. Thus, the addition or removal of a node can impact a maximum of k/n keys in the system. To ensure that there is near ideal distribution, we introduce virtual nodes in the system. Every physical node has multiple virtual nodes on the hash ring.
Conclusion
Consistent Hashing has several wide applications. It is used for sharding, load balancing, etc. It enables horizontal scalability. It minimizes disruption in a dynamic environment.
It’s used in Amazon’s Dynamo DB as a partitioning component. Further, open-source applications such as Apache Cassandra and Voldermort use it for data partitioning.