Idempotency: The Key to a Robust Distributed System

Prologue

Rohit Singh
Level Up Coding

--

I was working on a project to implement distributed coordination with transaction management for critical banking applications, and I was debating between saga and idempotency as data integrity is an important factor for financial institutions. This leads me to do a little deep dive into the better option. And finally, I have decided to implement an idempotent system with state management, as this suits my use case of retry and recovery strategy perfectly. So, we will see what idempotency is and how to implement it correctly. I will share my experience here.

Why is idempotency the key?

In modern distributed systems, availability is the key factor, which means you would build retries and means to handle failures and recover. Doing so also means you could try to process the same thing again in the system, but if your application does not understand this and treats it as a completely new request, it will produce undesirable results. If it were an application processing payments or managing e-commerce orders for shipping, it would cause a big monetary loss and irrefutable damage.

So, what should we do to guarantee that performing the operation multiple times will yield the same result as if it were executed only once?

Make the system idempotent!!!

What is idempotency?

Idempotency refers to the ability of a system or process to produce the same result even if the same operation is performed multiple times. Idempotency guarantees that executing the same operation multiple times doesn’t introduce unexpected side effects, safeguarding against accidental duplications or undesired changes. For example, in duplicate checks, an idempotent system ensures that duplicate requests will not result in duplicate processing, so if duplicate requests are made to the system, an idempotent system will either ignore this or can return the state processed the first time.

Wait a minute. I am confused now!!! Is doing duplicate checks the same as an idempotent operation??

Duplicate checks vs Idempotency

Duplicate checks are implemented to prevent unwanted state changes and ensure that we do not process the same event more than once, whereas idempotency can allow us to process the same event again, but the result would be the same. To simplify the difference, with idempotency, we can handle duplicate events in an at least-once messaging system, ensuring processing the same event multiple times still produces the same effect. In other words, idempotency is by far the best way to deal with duplicate events.

That’s good so far. But how do we implement it, and what are the challenges one can face???

Idempotency implementation strategy

In order to build idempotency, the most important task is to find or create an idempotent key for each request. I have created a simple algorithm strategy and added standard keys that can be used to build robust idempotency.

Strategy

  1. Find a unique correlation identifier for each request that we can rely on and store in the data store to check against each incoming request.
  2. If there is no unique identifier, invent one by hashing the payload to create checksums and using it for idempotency.
  3. If the payload contains UUIDs or timestamps, like a created timestamp, which might change on each retry, ignore those fields if possible and create checksums. If you cannot ignore it, ensure to have hardcoded values.
  4. In the case of POST APIs, add the x-idempotency-key request header and ask consumers to provide a unique identifier and an optional expiration period.
  5. In the case of an interconnected service mesh, where services collaboratively execute tasks, use a state change model along with duplicate checks to ensure the idempotency of the system.
  6. Define the validity of idempotency keys in terms of expiration, for example, how long your datastore is going to store this key for looking back.
  7. A generic model, as shown in the example, can be used for creating a composite idempotent key.
public class IdempotentKey {
private String key; // key
private long ttl; // time to live
private String result; // reponse
}

Idempotency keys

Standard keys that can be used to build idempotent keys:

  • UUID v4. Use standard java.util.UUID to create unique identifiers.
  • Payload Hashing: Create a hash (digest) of the request payload as the idempotency key. Guarantees that requests with the same payload generate the same key.
  • Key Elements: Include unique parameters like user ID, transaction details, and timestamps in the idempotency key. Creates a composite key for precise identification.
  • Tokens: Issue and require clients to include tokens in requests, serving as both idempotency keys and security measures.
  • Timestamps: Use timestamps or a combination for time-based idempotency keys, ensuring operations are idempotent within defined time windows.

Idempotent Model

How does this idempotency look, and how do you handle different scenarios? Below is the shell-level implementation of idempotency that I have developed.

  1. The client sends the payload “p12345” to the intake service for processing. The client can choose to send x-idempotency-key in the request header, which can be used as an idempotent key for processing. If the client does not send an idempotency key header, then we can create one by hashing the payload and storing the key in Mapper. (OrderId is generated by the system for each new request, if it is not there already.).
  2. Intake will carry out the duplicate check and see if the order is already in the system; if yes, it will return the current state of the order to the client; if no, it will persist the order with the current state — RCVD.
  3. Next, intake can do validations, if any are needed, and go for payment if successful. Load the current state of the order (as it could already been PAID), and if it is RCVD, process the payment and update the state — PAID. If intake does a retry of payment for an existing order, in case of network failures or any other reasons, our state model comes to rescue, and the retry will be rejected as it is already PAID.
  4. After successful payment, the intake will sent order for fulfilment. If the order state is PAID, the order is fulfilled and reaches its final state — FULFILLED. If a retry happens here, the state model will rescue.
  5. Do note that, in cases where multiple instances (pods) of the intake or payment service are running, ensure to have a cluster lock while updating the state.
  6. Another point is that in order to speed up the process of idempotency, one can use a caching layer to store mapper information, but in my case, I have not done that as the primary data store was good enough to handle high loads. (Avoid premature optimisation).

This is a use case for the demonstration of idempotency. For production, you need to think about your use case and how you can retrofit this model.

Conclusion

Idempotency is crucial for building resilient distributed systems at scale without impacting data integrity. Retry Strategy with Idempotency is one of the finest alternatives to distributed transactions, which are more complex and harder to manage as we scale.

References

--

--

Passionate technology professional, enthusiastic developer, and a devoured reader