Deep-Dive into TCP Connection Establishment Process

With lots of links to the Linux kernel TCP stack

Published in

Level Up Coding

10 min readMay 28, 2023

While working on the Redis Internals series, I decided to take a quick look at TCP implementation in the latest commit of the Linux kernel master branch. Who knew it would be a week-long detour, but boy was it worth it!

Let’s get right into it.

Overview: the Three-Way Handshake and All of His Friends

Here, I start with the basics, and then narrow down my focus and proceed to a connection establishment specifically.

What is a TCP socket?

It’s just a programming construct used in the Linux kernel source code. It’s an abstraction over how Linux communicates with other hosts distributed over a network.

Client-server communication: a client’s view

When an application wants to exchange data with a server, first it should establish a connection. As a result, Linux kernel creates a socket containing all the information needed to send requests in a reliable way, and returns its id to an application. Then, an application can send a request using this socket. When it does so, an underlying Linux TCP stack temporarily saves this request in a socket; it’s actually sent over a network a few moments later. This process is referred to as “write in a socket”.

Client-server communication: a server’s view

A server must create at least one listening socket to be able to accept client connections. When a client sends a tcp connection request, server’s listening socket reads it and the Linux kernel creates a new socket, specifically for further data exchange with this client. This newly created socket is called a connection socket. A server application source code must have access to its identifier, or a descriptor — to be able to read a specific client’s requests and write it responses. In order to do that, a server accepts a connection request, issuing a specific command. When it’s done, a data exchange becomes possible.

When a server receives a request, server-side Linux TCP stack stores it in a socket until an application reads it. That’s why a process of receiving data is often called “reading from a socket”.

Connection establishment

The process of connection creation in TCP is called the three-way handshake. It has three stages. First, a client sends a SYN packet. Server receives it and creates a data structure typically referred to as a connection request, or mini socket. It has an information that the future connection socket will need, so it has to store it until a connection is fully established. After this, a server sends a confirmation that a syn packet was processed. It’s sent in a syn-ack packet. When a client receives it, it finally sends the last packet, an ack. Server Linux kernel then creates a connected socket waiting for a server application to obtain, or accept, it.

TCP segments, sequence numbers, and acknowledgements

TCP stack receives a stream of bytes and converts them into chunks known as TCP segments. It’s size is agreed upon during a connection establishment. It’s known as MSS: maximum segment size. This name is misleading: it denotes the maximum size of tcp segments’ payload that a current connection will use. That is, it excludes tcp and ip headers size. What’s important here, a sending part can suggest its own value in a syn packet, but ultimately it’s up to a receiving part what value to select. Generally, it doesn’t choose a value greater than what a client proposed. This value is then communicated back to a sending part in a syn-ack packet.

Each tcp segment contains a sequence number of the first byte of data in current segment, within a current connection. It’s called just a sequence number, for short. But it doesn’t start with zero. In this case, an attacker could guess it and send counterfeit packets. That’s why it’s chosen randomly by both sides during the three-way handshake. Each side generates its own sequence number, sends it to a counterpart, and waits for an acknowledgment:

Three-way handshake sequence generation for both hosts. From now on, they can exchange data.

When a host receives a sequence number generated by the other host, it simply increments it and sends back. Thus, sequence number validation is trivial for an original sender.

After a three-way handshake process completed successfully, data exchange looks like that:

Here, only an initiating host sends packets with data. Receiving host just acknowledges them.

Queues

When a syn packet arrives, server-side Linux kernel creates a connection request and saves it in a way historically referred to as a queue; you can stumble upon it elsewhere by the names of “half-open connection queue”, an “incomplete connection queue”, a “SYN queue”, “connection request queue”, or “request socket queue”. The last one owns its name to a structure which represents a connection request, a request_sock:

After an ack packet has arrived, connection request is removed from a connection request queue and a connection socket is created. It’s put in another queue which is known by a variety of names: accept queue, backlog queue, pending connection queue, or connection backlog. When a server application “accepts” a connection (using some command that boils down to an accept system call), it simply dequeues an existing socket from an accept queue.

You’ll see a more detailed explanation of this process below.

SYN-flood DDoS attack

What happens if lots of clients send a syn packet, but don’t send an ack packet? In previous Linux versions, half-open connections were added in an accept queue. This caused problems to legitimate connections, since an accept queue is home for not yet accepted connection sockets.

In newer Linux versions, there is a separate storage for connection requests. But server resources can get exhausted anyway — Linux kernel has to store all those connection requests until an ack packet arrives, which, in case this attack is carried out, never happens.

But there is a workaround.

Syncookies

Actually, almost all of the data a connection request contains can be obtained on an ack-processing step. An exception for this is an mss value sent by an initiating host: it sends it only in a syn packet. Frankly, I don’t think that there is some fundamental principle violated if an initiating host sends an mss in an ack packet. For me, it looks more of a TCP design flaw.

However, TCP does the following. When a receiving host processes a syn packet, it extracts an mss value, clamps it to a user-defined value if set, and encodes it in a sequence number which is sent back to a sending host in a syn-ack packet. Sending host acknowledges a syn-ack packet just the same way: it increments receiving host’s sequence number and sends it in an ack packet. Receiving host extracts an mss value from it and calculates a final result.

This mechanism can be either totally disabled, or enabled only when a connection request queue is full, or always used, regardless of connection request queue state.

Now, let’s dive into the source code.

The Three-Way Handshake with no Syncookies

First, here is an image again of what a three-way handshake looks like when syncookies are not used:

Let’s consider it step by step.

Receiving host receives a `syn` packet

When a client initiates a 3wh, it sends a syn packet to the server. There,
the TCP layer’s entrypoint is a tcp_v4_rcv procedure. When a packet is received, a listening socket is looked up. If it’s found, its state is TCP_LISTEN. Through a tcp_v4_do_rcv() helper procedure, we reach tcp_rcv_state_process(). This procedure is effectively a state machine applied to a socket that processes a current packet: it first checks a socket state, then runs some logic based on it, and advances the state accordingly. There, I fall into a clause indicating that it’s a syn packet processing stage. Then, through a tcp_v4_conn_request() procedure, I find myself in tcp_conn_request(). I’m interested in a case when syncookies are not used, but normal operation is possible. This boils down to a single case when syncookies are either disabled or have a default value of 1, and a request socket queue is not full (and given a listening socket is not in a TIME_WAIT state).

As a side note, when tcp_syncookies is 2, then syncookies are always used, even when request socket queue is not full.

A request socket is created. As I mentioned above, request socket is not a full-blown connection socket used for data exchanging, instead it’s a connection request, or a mini socket — as it’s typically referred to in Linux source code. Its ireq_state is TCP_NEW_SYN_RECV, a listener socket is attached to it and a syn packet saved. Finally, a request_sock is cast to a sock struct and is added in a “queue”. Well, actually it’s not a queue. It’s a mechanism consisting of two parts:

a hash-table, usually referred to as an ehash, which is used for socket lookups;
and, however crazy, an accept queue tracking its length through the qlen field. To make things more confusing, an accept queue is represented as a struct with a request_sock_queue name.

This “queue” is often referred to as a half-open connection queue, or a SYN queue — as I mentioned above. Here we can see its length incrementing when a new connection request is added.

And finally, a syn-ack is sent.

So, for now, no socket is created. Only a mini socket is created, which is a request_sock struct disguised as a normal socket represented as a sock structure.

Receiving host receives an ack packet

As during a syn packet processing step, a socket is looked up. In an __inet_lookup procedure, __inet_lookup_established is issued first. It looks up a socket in an ehash. And there is one indeed — a temporary request_sock cast to a sock struct which was added during a syn packet processing stage. Its state is still TCP_NEW_SYN_RECV, so we fall through a sk->sk_state == TCP_NEW_SYN_RECV clause. Then, a tcp_check_req invoked there creates a child socket by cloning a listening socket. This newly created socket is in a TCP_SYN_RECV state. It’s added in an ehash, so __inet_lookup_skb will find it when a client starts exchanging data. Right after this normal socket was added, a mini socket is removed from an ehash — it’s not needed there anymore.

Then, a socket is added in a tail of an accept queue and its length incremented. From an implementation point of view, it’s not a socket actually. Instead, it’s the same request_sock created on a syn stage, but it now stores a child socket created earlier. It’s a funny asymmetry: request queue conceptually has connection requests, although technically it contains sockets (not to mention it’s not a queue whatsoever), while an accept queue conceptually contains sockets, although technically it consists of connection requests. The length of an accept queue is tracked through a sk_ack_backlog field of a listening socket.

Finally, a tcp_child_process is called. It calls an already familiar tcp_rcv_state_process, but this time for a newly created socket. Its state is TCP_SYN_RECV, and it’s set to TCP_ESTABLISHED. It’s worth noting that each socket arriving in an accept queue is in an TCP_SYN_RECV state; it becomes established a few moments later.

The Three-Way Handshake with Syncookies

I won’t delve into the same level of detail as in a previous section — much of an implementation remains the same. I only highlight the differences between a scenario when syncookies are used and when they’re not.

First, here is a flow:

Receiving host receives a syn packet

As on a previous step, tcp_rcv_state_process is invoked through tcp_v4_rcv entrypoint and tcp_v4_do_rcv helper. There, in a tcp_conn_request procedure, we read sysctl_tcp_syncookies setting. It’s 1 by default. Syncookies mechanism is used if it’s 2 or if a connection queue is full, given a listening socket is not in a TIME_WAIT state, namely a tcp_tw_isn field is empty.

Then, a syncookie is generated. In case of tcp, cookie_v4_init_sequence procedure does that. Syncookie is a sum of the following values:

two hashes of addresses and ports of a source and a destination;
Linux internal representation of current time;
a clamped mss value and a sequence number sent by an initiating host.

Then it’s all sent to a client in a seq field of a syn-ack packet.

Although a request socket was created as part of an implementation, it’s not put in a connection request queue. Instead, it’s disposed of.

Receiving host receives an ack packet

As always, everything starts with a tcp_v4_rcv procedure. Since there is no connection request in an ehash, only a listening socket can be found. It’s in a TCP_LISTEN state, so we find ourselves in tcp_v4_do_rcv. There, a syncookies mechanics it started. First, a syncookie value is validated:

It’s checked that a syncookie must be less than 120 seconds old;
Then, an mss value is extracted and returned. If this value is in a range from 0 to 3, then syncookie is considered valid.

If everything is ok, then a socket is created. More precisely, it’s achieved with tcp_v4_syn_recv_sock procedure — the same one used when syncookie is disabled. This socket ends up in an accept queue the same way either. For now, it’s still in an TCP_SYN_RECV state.

The three-way handshake completes similar to the previous section — with tcp_child_process procedure invoked for a child socket.

In closing

So that’s about it. This topic is extensive indeed, I have highlighted only those details that were of particular interest to me. Hopefully, there will be some more posts about intricacies of TCP implementation.

UPD: an original version of this post contained a mistake in a syncookies explanation. Thanks to the noble and benevolent Hacker News souls who pointed it out — in a kind and gentle manner that we all lack in the darker places of the Internet.