Network timeouts
Many changes in distributed systems could cause network related timeouts. TCP/IP provides plenty of feedbacks for different kinds of network errors. If we use these feedbacks properly, we could get rid of many network timeouts.
For a simple HTTP request, 3 typical network timeouts could happen:
-
DNS lookup timeout
-
TCP connection timeout
-
Timeout after TCP connection is established
All network timeouts indicate loss of error feedback information. If the error feedback information propagates back to an application, the application can take immediate actions without waiting for the timeout to happen.
In this post, we first examine different error feedback signals, then look into each of the above 3 types of timeouts.
We need to pay special attention to the 3rd type of timeouts because it is more likely to be ignored. For example, gRPC does not handle this type of timeout error properly with its default configuration. When the error happens, it could take gRPC more than 13 minutes to reset a connection.
1. Network error feedbacks
Many network problems have feedback signals in the form of TCP control flags or ICMP messages.
1.1. TCP RST
A host replies a RST
packet when it receives an invalid TCP packet.
For example, when a SYN
is received on a closed port, or when an
ACK
is received without proper handshake. In most real-life cases,
RST
packets are received when the remote host is up, but the remote
server process is down.
Depending on the local TCP connection state, syscalls like connect()
or send()/recv()
return different error values when the connection
receives RST
messages, as shown in the following kernel code
snippet:
switch (sk->sk_state) {
case TCP_SYN_SENT:
sk->sk_err = ECONNREFUSED;
break;
case TCP_CLOSE_WAIT:
sk->sk_err = EPIPE;
break;
case TCP_CLOSE:
return;
default:
sk->sk_err = ECONNRESET;
}
The connect()
syscall returns ECONNREFUSED
. The send()
and
recv()
syscalls could return ECONNRESET
or EPIPE
.
1.2. TCP FIN
FIN
messages do not indicate errors. A TCP endpoint sends FIN
messages when it closes the connection.
When a TCP endpoint receives a FIN
message from its remote peer, the
TCP connection enters CLOSE_WAIT
state, also known as "half closed
state". All recv()
syscall returns 0 which indicates EOF, and the
endpoint should close the connection.
If the endpoint does not close the connection, it could still calls
send()
syscall, the send()
may succeed and data packets are sent
to the remote peer. Then what happens depends on whether the remote
host is up:
-
If the remote host is up, it replies a
RST
message, causing the sender's newsend()
syscalls to returnEPIPE
, -
If the remote host is down, the sender may receive no feedback (assuming no ICMP). The sender then retransmits packets and eventually times out. The retransmission timeout is described in the following section.
1.3. ICMP DEST_UNREACH
ICMP DEST_UNREACH
messages cover many types of network problems.
They could be generated by the remote endpoint as well as any
intermediate router. The following table summarizes how kernel
handles some DEST_UNREACH
messages and the return values of related
syscalls.
ICMP code | UDP sendto | UDP connected send | TCP connect | TCP send/recv |
---|---|---|---|---|
NET_UNREACH | no error | no error | ENETUNREACH | retry |
HOST_UNREACH | no error | no error | EHOSTUNREACH | retry |
PORT_UNREACH | no error | ECONNREFUSED | ECONNREFUSED | retry |
PKT_FILTERED | no error | EHOSTUNREACH | EHOSTUNREACH | retry |
The NET_UNREACH
message is normally generated by a router when it
cannot find a route to the destination.
The HOST_UNREACH
message is generated by a router when it cannot
resolve the next hop MAC address via ARP. Normally 3 ARP requests are
sent in 3 seconds. If no ARP response is received, the router returns
HOST_UNREACH
.
The PORT_UNREACH
message is generated by the remote endpoint for UDP
when the port is not open. For TCP, RST
is generated instead of
PORT_UNREACH
when the port is not open.
The PKT_FILTERED
error is "Communication Administratively
Prohibited",
it is generated "if a router cannot forward a packet due to
administrative filtering".
These ICMP messages could also be generated with ip route
command or
iptables
command. For example both the following commands could
generate PKT_FILTERED
messages for a destination.
# ip route add prohibit <dest>
# iptables -I FORWARD -d <dest> -j REJECT --reject-with icmp-admin-prohibited
For UDP, unconnected sockets ignore all ICMP errors. Connected
sockets forwards "hard" ICMP errors to applications. The
NET_UNREACH
and HOST_UNREACH
errors are transient (soft) errors,
and are not forwarded to applications according to RFC 1122 section
3.2.2.1. The other
2 errors are hard errors, thus are forwarded to applications.
For TCP, the connect()
syscall reports all ICMP errors. After a
connection is established, TCP does not report ICMP errors to
applications anymore. Instead TCP keeps retrying until retransmission
times out (>13 mins). If all retries fail, the send()
or recv()
syscall returns the same error as what connect()
returns. The
reason that TCP does not immediately notify applications of ICMP
errors is for counteracting ICMP attacks. Details of the ICMP attacks
are described in RFC 5927.
For both UDP and TCP, if the socket option IP_RECVERR
is enabled,
then all ICMP errors are propagated to the user application.
2. Information loss
As shown in the previous section, information could get lost in the
kernel TCP stack. Unless the socket option IP_RECVERR
is enabled,
many ICMP messages are consumed by the kernel without notifying
applications.
Another cause of information loss is packet drops. IP network is lossy. Any packet could be dropped. Additionally, firewalls could be configured to drop ICMP messages for security concerns.
When NAT is used, ICMP messages are translated and returned to the original sender transparently.
When tunneling is used, ICMP messages may get dropped depending on the tunnel implementation. For example, IP-IP tunnel should return ICMP messages to the original sender according to RFC 2003. However, Linux does not convert and return ICMP messages generated in the tunnel transit network.
There are also network errors that do not generate feedback signal at
all. For example, when SNAT cannot allocate a new port for a
connection due to conflicts, the SYN
packet is silently dropped.
More details of this error can be found at
here.
3. Timeouts
In general, applications could (and should) always enforce timeouts by canceling an operation after waiting for some time without getting a response. If an application does not enforce timeouts, the lower layers, i.e. glibc and kernel, have some default timeouts. The following table summarizes the default durations of different types of timeouts:
type | duration |
---|---|
DNS lookup timeout | 10 seconds * number of nameservers |
TCP connection timeout | 127 seconds |
TCP idle timeout w/o keepalive | infinity |
TCP idle timeout with keepalive | > 2 hours |
TCP retransmission timeout | 13 to 60 minutes |
3.1. DNS lookup timeout
The DNS queries are normally sent by a glibc function res_send()
,
which sends queries in UDP packets to nameservers. Function
res_send()
implements timeout limits for the DNS queries. The
timeout duration is approximately 10 seconds multiplied by the number
of nameservers defined in the file /etc/resolv.conf
. This blog
post has some related details.
The glibc function res_send()
uses connected UDP socket with socket
option IP_RECVERR
enabled. A related ICMP error message makes a DNS
query abort immediately. So to avoid DNS lookup timeout, we should
make sure to generate ICMP DEST_UNREACH
messages for invalid DNS
packets.
3.2. TCP connection timeout
An application calls connect()
syscall to start a TCP 3-way
handshake. The kernel TCP stack sends SYN
packets and waits for a
SYN-ACK
packet from the remote peer to complete the handshake. If
no SYN-ACK
packet is received, the TCP stack retransmits SYN
packets with exponential back-off. After a few retries, connect()
gives up and returns ETIMEDOUT
.
The number of SYN retries can be configured with sysctl parameter
net.ipv4.tcp_syn_retries
, or socket option TCP_SYNCNT
. The
default retry number is 6. Plus the original one, 7 SYN
packets are
sent before the connection is aborted. The first retry interval is
the initial retransmission timeout (rto), which is 1 second. 7
retries with exponential back-off take 127 seconds to complete.
The connection timeout happens when the SYN
, SYN-ACK
or ICMP error
messages are dropped in the network.
The remote peer could also drop SYN
packets if the socket backlog
queue is full. The number of dropped SYN
packets can be shown with
the following command:
$ netstat -s | grep -i listen
21 times the listen queue of a socket overflowed
21 SYNs to LISTEN sockets dropped
3.3. TCP established timeout
When a TCP connection is idle, i.e. there is no outstanding data to
transmit, the kernel TCP stack does nothing by default. If an
application only receives data from a TCP socket, the recv()
call
could block forever without noticing network connection errors.
If the socket option SO_KEEPALIVE
is enabled, the kernel TCP stack
sends "keepalive" probes (ACK
messages) when the connection is idle.
However, it takes more than 2 hours for the keepalive probes to detect
a dead connection. More details of the keepalive probes can be found
at the TCP man page.
When there are outstanding packets to send, a host retransmits packets
if no ACK
is received. The retransmission happens regardless of
whether the host receives any ICMP error messages or not, as shown in
the previous section.
The default number of retransmission is 15, which can be changed via
the sysctl parameter net.ipv4.tcp_retries2
. The whole
retransmission process takes from 13 to 30 minutes depending on the
rto.
In practice, these timeout errors happen in the following cases:
-
FIN
packets from a host are dropped then the host goes down. -
A host kernel panics, all TCP connections from the host are gone and no
FIN
packet is sent. -
In a kubernetes cluster using calico cni, when a node is deleted with
kubectl delete node
, then packets from the node are immediately dropped by other nodes due to iptables filters. For example, this problem causes this bug. -
A mis-configured firewall drops ICMP
FRAG_NEEDED
messages, causing timeout during TCP path MTU discovery.
The following table shows syscall (recv()
, send()
, and etc) return
values for these timeout errors:
abort reason | received ICMP | errno |
---|---|---|
keepalive timeout | yes | ETIMEDOUT |
keepalive timeout | no | ETIMEDOUT |
retransmission timeout | yes | converted ICMP error |
retransmission timeout | no | ETIMEDOUT |
All these timeout errors are counted in the SNMP mib
TCPABORTONTIMEOUT
and can be shown with the following command:
$ netstat -s | grep timeout
3 connections aborted due to timeout
4. Summary
Different network timeouts have different durations and return values. ICMP messages provide error feedback signals for network problems. However, many ICMP messages are not forwarded to applications.
We could reduce network timeout errors by doing the following:
-
Use
ip route
oriptables
to generate ICMPDEST_UNREACH
messages for non-existing hosts. -
Make firewall filters send error feedbacks instead of dropping packets silently.
-
Do not drop ICMP
HOST_UNREACH
messages. -
Enable socket option
IP_RECVERR
to receive and handle ICMP errors in applications.