Deep Dive into Remote Procedural Calls (RPC)

The goal of using RPC is to make communication between the client and server seem like a regular procedural call.

Melodies Sim
Level Up Coding

--

I was first introduced to Remote Procedural Call (RPC) while working on a software engineering entry task back in December. I soon learned that RPC is an essential component employed in many of my company’s distributed infrastructures. As a new intern, I didn’t fully understand the benefits RPC brings to the table. I was just contented to finish up my entry task and start working on my internship project.

If you are in the same boat as me, let me try to explain what RPC is. RPC is a layer of abstraction that enables easy-to-program communication between a client and server. With RPC, messy details of network protocols are encapsulated and hidden away. It is no secret that RPC holds a major role in distributed systems today.

In writing this short article, I hope to provide some intuitions behind RPC, something I wished I could appreciate more when I was implementing my entry task back then. I will also discuss some edge cases in the event of network and server failures, and general techniques that can be used to deal with such failures.

This article is the second post of the Deep Dive into Distributed System Series, which seeks to dig deeper into common concepts or mechanics used in distributed systems. Check out the first post here.

Role of RPC in Client-Server Architectures

To understand how RPC plays its role as a key piece of distributed system machinery, we have to examine the goals of RPC and the context where RPC is needed. RPC is used mainly in the client-server model as a layer of abstraction to simplify the inter-communication between clients and servers.

In many distributed systems, the client-server model is used to partition workloads between providers of a service called servers and service requesters called clients [1]. Often clients and servers communicate over a computer network on separate hardware, but both client and server may reside in the same machine [1]. A client usually initiates communication services with servers, which await incoming requests. For example in a web application, clients refer to the front-end which the users directly interact with, and servers refer to the backend that processes requests and returns the result to the client processes.

The goal of RPC is to make communication between the client and server seem like a regular procedural call. From the client’s point of view, RPC abstracts away the details of making network calls, and of parsing requests and responses, making it seems that the client is just calling a regular function to process the request. This simplifies things for the client — the client can focus on the main logic of the app and leave the details to the RPC library.

Structure of how RPC is used in client-server communication.

As seen in the Figure above, a client will call stub functions (that seem like another regular function), which contains calls to the RPC library. The RPC library marshals the requests and sends them over the network to the server. The request is received and unmarshalled by the RPC library before getting dispatched to the respective handler function in the server. After the handler functions generate the desired result, the RPC library marshalls the result into a response which is sent back to the client for unmarshalling. The client will receive the result as a return value of the stub function that was initially called.

Edge Cases: How to Deal with Failures

Distributed systems typically involve tens to thousands of machines. Failures are inevitable. We have to deal with failures to provide a robust and available service. Some of the most common forms of failure include lost packets, broken network communication, slow or crashed servers.

What does a failure look like to the Client RPC library?

However, it is rarely straightforward to detect what failures have occurred. Let’s consider how a failure will look like from the client RPC’s perspective. Here’s we list a few possible scenarios that could happen in event of failures:

  • The client never sees a response from the server.
  • The client does *not* know if the server saw the request!
  • Maybe the server never saw the request.
  • Maybe the server executed but crashed just before sending a reply.
  • Maybe the server executed, but the network died just before delivering the reply.

The figure below depicts three failure scenarios where the client does not receive a response and is unable to distinguish between the three failures.

All three scenarios above are indistinguishable from the client’s perspective — the client did not receive a response in any of the three cases.

Handling Failures in RPC

The simplest way to handle failures is using a “Best-Effort” approach where the client will retry several times before giving up. For example, let’s consider a client executes a Get(key) command in a simple key-value store (server):

  1. Get(key) will wait for a response for a pre-defined timeout period.
  2. If no response is received after the timeout, re-send the request.
  3. Repeat this a few times.
  4. Give up and return an error to the client.

However, as some of you can tell, the best-effort approach is not the best (pun intended). It can sometimes even cause serious problems in some applications. Consider the case when the same client tries to execute 2 different Put(key, value) commands:

  • Put(5, 10)
  • Put(5, 20) — run shortly after Put(5,10)

What will Get(5) yield? Depending on the network conditions and packet delay, even though Put(5,20) is executed by the client after Put(5,10), it is still possible to yield a value of 10 from Get(5). Can you see why?

For operations that are not idempotent (i.e. applying the operations more than once will result in a different result), the best-effort approach can also lead to unexpected results. For example, addition is not an idempotent operation because if Add(1) is executed twice to a variable of initial value 0, we will get 2, instead of 1 when Add(1) is executed only once. In contrast, an idempotent operation would be Put(5, 5) because executing it twice will result in the same result as executing it only once. Suppose that a client retries to execute an Add(1) command after a timeout, however, the first Add(1) command arrives at the server after the timeout. The state will be 2, instead of the expected 1.

Depending on the business requirements of the service, these behaviors may not be acceptable. A better way to handle failure is the “at most once” scheme. The idea is that the server will detect duplicate requests using unique IDs and return the previous reply instead of re-running the handler function. This means that non-idempotent operations will be executed at most once, avoiding the problem described above.

How about an “exactly-once” scheme where we can guarantee that every request sent is executed only once eventually? This is the ideal case but it is more complicated to implement in practice. The intuition behind such a scheme would be to use unbounded retries with duplicate detection. A fault-tolerant system is also required to ensure that no requests are lost due to server crashes.

References

[1] Client-server model — Wikipedia https://en.wikipedia.org/wiki/Client%E2%80%93server_model

[2] MIT 6.824 RPC and Threads Note https://pdos.csail.mit.edu/6.824/notes/l-rpc.txt

--

--

Passionate in distributed systems, C++ things. Follow me on twitter @melodiessim