Network timeouts

November 17, 2019

Many changes in distributed systems could cause network related timeouts. TCP/IP provides plenty of feedbacks for different kinds of network errors. If we use these feedbacks properly, we could get rid of many network timeouts.

For a simple HTTP request, 3 typical network timeouts could happen:

DNS lookup timeout
TCP connection timeout
Timeout after TCP connection is established

All network timeouts indicate loss of error feedback information. If the error feedback information propagates back to an application, the application can take immediate actions without waiting for the timeout to happen.

In this post, we first examine different error feedback signals, then look into each of the above 3 types of timeouts.

We need to pay special attention to the 3rd type of timeouts because it is more likely to be ignored. For example, gRPC does not handle this type of timeout error properly with its default configuration. When the error happens, it could take gRPC more than 13 minutes to reset a connection.

1. Network error feedbacks

Many network problems have feedback signals in the form of TCP control flags or ICMP messages.

1.1. TCP `RST`

A host replies a RST packet when it receives an invalid TCP packet. For example, when a SYN is received on a closed port, or when an ACK is received without proper handshake. In most real-life cases, RST packets are received when the remote host is up, but the remote server process is down.

Depending on the local TCP connection state, syscalls like connect() or send()/recv() return different error values when the connection receives RST messages, as shown in the following kernel code snippet:

	switch (sk->sk_state) {
	case TCP_SYN_SENT:
		sk->sk_err = ECONNREFUSED;
		break;
	case TCP_CLOSE_WAIT:
		sk->sk_err = EPIPE;
		break;
	case TCP_CLOSE:
		return;
	default:
		sk->sk_err = ECONNRESET;
	}

The connect() syscall returns ECONNREFUSED. The send() and recv() syscalls could return ECONNRESET or EPIPE.

1.2. TCP `FIN`

FIN messages do not indicate errors. A TCP endpoint sends FIN messages when it closes the connection.

When a TCP endpoint receives a FIN message from its remote peer, the TCP connection enters CLOSE_WAIT state, also known as "half closed state". All recv() syscall returns 0 which indicates EOF, and the endpoint should close the connection.

If the endpoint does not close the connection, it could still calls send() syscall, the send() may succeed and data packets are sent to the remote peer. Then what happens depends on whether the remote host is up:

If the remote host is up, it replies a RST message, causing the sender's new send() syscalls to return EPIPE,
If the remote host is down, the sender may receive no feedback (assuming no ICMP). The sender then retransmits packets and eventually times out. The retransmission timeout is described in the following section.

1.3. ICMP `DEST_UNREACH`

ICMP DEST_UNREACH messages cover many types of network problems. They could be generated by the remote endpoint as well as any intermediate router. The following table summarizes how kernel handles some DEST_UNREACH messages and the return values of related syscalls.

ICMP code	UDP sendto	UDP connected send	TCP connect	TCP send/recv
NET_UNREACH	no error	no error	ENETUNREACH	retry
HOST_UNREACH	no error	no error	EHOSTUNREACH	retry
PORT_UNREACH	no error	ECONNREFUSED	ECONNREFUSED	retry
PKT_FILTERED	no error	EHOSTUNREACH	EHOSTUNREACH	retry

The NET_UNREACH message is normally generated by a router when it cannot find a route to the destination.

The HOST_UNREACH message is generated by a router when it cannot resolve the next hop MAC address via ARP. Normally 3 ARP requests are sent in 3 seconds. If no ARP response is received, the router returns HOST_UNREACH.

The PORT_UNREACH message is generated by the remote endpoint for UDP when the port is not open. For TCP, RST is generated instead of PORT_UNREACH when the port is not open.

The PKT_FILTERED error is "Communication Administratively Prohibited", it is generated "if a router cannot forward a packet due to administrative filtering".

These ICMP messages could also be generated with ip route command or iptables command. For example both the following commands could generate PKT_FILTERED messages for a destination.

# ip route add prohibit <dest>
# iptables -I FORWARD -d <dest> -j REJECT --reject-with icmp-admin-prohibited

For UDP, unconnected sockets ignore all ICMP errors. Connected sockets forwards "hard" ICMP errors to applications. The NET_UNREACH and HOST_UNREACH errors are transient (soft) errors, and are not forwarded to applications according to RFC 1122 section 3.2.2.1. The other 2 errors are hard errors, thus are forwarded to applications.

For TCP, the connect() syscall reports all ICMP errors. After a connection is established, TCP does not report ICMP errors to applications anymore. Instead TCP keeps retrying until retransmission times out (>13 mins). If all retries fail, the send() or recv() syscall returns the same error as what connect() returns. The reason that TCP does not immediately notify applications of ICMP errors is for counteracting ICMP attacks. Details of the ICMP attacks are described in RFC 5927.

For both UDP and TCP, if the socket option IP_RECVERR is enabled, then all ICMP errors are propagated to the user application.

2. Information loss

As shown in the previous section, information could get lost in the kernel TCP stack. Unless the socket option IP_RECVERR is enabled, many ICMP messages are consumed by the kernel without notifying applications.

Another cause of information loss is packet drops. IP network is lossy. Any packet could be dropped. Additionally, firewalls could be configured to drop ICMP messages for security concerns.

When NAT is used, ICMP messages are translated and returned to the original sender transparently.

When tunneling is used, ICMP messages may get dropped depending on the tunnel implementation. For example, IP-IP tunnel should return ICMP messages to the original sender according to RFC 2003. However, Linux does not convert and return ICMP messages generated in the tunnel transit network.

There are also network errors that do not generate feedback signal at all. For example, when SNAT cannot allocate a new port for a connection due to conflicts, the SYN packet is silently dropped. More details of this error can be found at here.

3. Timeouts

In general, applications could (and should) always enforce timeouts by canceling an operation after waiting for some time without getting a response. If an application does not enforce timeouts, the lower layers, i.e. glibc and kernel, have some default timeouts. The following table summarizes the default durations of different types of timeouts:

type	duration
DNS lookup timeout	10 seconds * number of nameservers
TCP connection timeout	127 seconds
TCP idle timeout w/o keepalive	infinity
TCP idle timeout with keepalive	> 2 hours
TCP retransmission timeout	13 to 60 minutes

3.1. DNS lookup timeout

The DNS queries are normally sent by a glibc function res_send(), which sends queries in UDP packets to nameservers. Function res_send() implements timeout limits for the DNS queries. The timeout duration is approximately 10 seconds multiplied by the number of nameservers defined in the file /etc/resolv.conf. This blog post has some related details.

The glibc function res_send() uses connected UDP socket with socket option IP_RECVERR enabled. A related ICMP error message makes a DNS query abort immediately. So to avoid DNS lookup timeout, we should make sure to generate ICMP DEST_UNREACH messages for invalid DNS packets.

3.2. TCP connection timeout

An application calls connect() syscall to start a TCP 3-way handshake. The kernel TCP stack sends SYN packets and waits for a SYN-ACK packet from the remote peer to complete the handshake. If no SYN-ACK packet is received, the TCP stack retransmits SYN packets with exponential back-off. After a few retries, connect() gives up and returns ETIMEDOUT.

The number of SYN retries can be configured with sysctl parameter net.ipv4.tcp_syn_retries, or socket option TCP_SYNCNT. The default retry number is 6. Plus the original one, 7 SYN packets are sent before the connection is aborted. The first retry interval is the initial retransmission timeout (rto), which is 1 second. 7 retries with exponential back-off take 127 seconds to complete.

The connection timeout happens when the SYN, SYN-ACK or ICMP error messages are dropped in the network.

The remote peer could also drop SYN packets if the socket backlog queue is full. The number of dropped SYN packets can be shown with the following command:

$ netstat -s | grep -i listen
    21 times the listen queue of a socket overflowed
    21 SYNs to LISTEN sockets dropped

3.3. TCP established timeout

When a TCP connection is idle, i.e. there is no outstanding data to transmit, the kernel TCP stack does nothing by default. If an application only receives data from a TCP socket, the recv() call could block forever without noticing network connection errors.

If the socket option SO_KEEPALIVE is enabled, the kernel TCP stack sends "keepalive" probes (ACK messages) when the connection is idle. However, it takes more than 2 hours for the keepalive probes to detect a dead connection. More details of the keepalive probes can be found at the TCP man page.

When there are outstanding packets to send, a host retransmits packets if no ACK is received. The retransmission happens regardless of whether the host receives any ICMP error messages or not, as shown in the previous section.

The default number of retransmission is 15, which can be changed via the sysctl parameter net.ipv4.tcp_retries2. The whole retransmission process takes from 13 to 30 minutes depending on the rto.

In practice, these timeout errors happen in the following cases:

FIN packets from a host are dropped then the host goes down.
A host kernel panics, all TCP connections from the host are gone and no FIN packet is sent.
In a kubernetes cluster using calico cni, when a node is deleted with kubectl delete node, then packets from the node are immediately dropped by other nodes due to iptables filters. For example, this problem causes this bug.
A mis-configured firewall drops ICMP FRAG_NEEDED messages, causing timeout during TCP path MTU discovery.

The following table shows syscall (recv(), send(), and etc) return values for these timeout errors:

abort reason	received ICMP	errno
keepalive timeout	yes	ETIMEDOUT
keepalive timeout	no	ETIMEDOUT
retransmission timeout	yes	converted ICMP error
retransmission timeout	no	ETIMEDOUT

All these timeout errors are counted in the SNMP mib TCPABORTONTIMEOUT and can be shown with the following command:

$ netstat -s | grep timeout
    3 connections aborted due to timeout

4. Summary

Different network timeouts have different durations and return values. ICMP messages provide error feedback signals for network problems. However, many ICMP messages are not forwarded to applications.

We could reduce network timeout errors by doing the following:

Use ip route or iptables to generate ICMP DEST_UNREACH messages for non-existing hosts.
Make firewall filters send error feedbacks instead of dropping packets silently.
Do not drop ICMP HOST_UNREACH messages.
Enable socket option IP_RECVERR to receive and handle ICMP errors in applications.