tcp 参数 so_linger 说明及测试

java Socket SO_LINGER 设置方法源码

9991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027
/** * Enable/disable {@link SocketOptions#SO_LINGER SO_LINGER} with the * specified linger time in seconds. The maximum timeout value is platform * specific. * * The setting only affects socket close. * * @param on     whether or not to linger on. * @param linger how long to linger for, if on is true. * @exception SocketException if there is an error * in the underlying protocol, such as a TCP error. * @exception IllegalArgumentException if the linger value is negative. * @since JDK1.1 * @see #getSoLinger() */public void setSoLinger(boolean on, int linger) throws SocketException {    if (isClosed())        throw new SocketException("Socket is closed");    if (!on) {        getImpl().setOption(SocketOptions.SO_LINGER, new Boolean(on));    } else {        if (linger < 0) {            throw new IllegalArgumentException("invalid value for SO_LINGER");        }        if (linger > 65535)            linger = 65535;        getImpl().setOption(SocketOptions.SO_LINGER, new Integer(linger));    }}

Socket SO_LINGER 参数设置说明

SO_LINGER 这个参数是控制 socket 关闭后的行为,不看系统底层的实现,只从上面 java 代码的实现来看,这个方法的二个参数onlinger有以下三种组合:

on = false

这是默认行为,当onfalse时,linger对应的设置就没有意义,当 socket 主动 close,调用的线程会马上返回,不会阻塞,残留在缓冲区中的数据将继续发送给对端,并且与对端进行 FIN-ACK 协议交换,最后进入 TIME_WAIT 状态。

on = true, linger > 0

调用 close 的线程将阻塞,发生两种可能的情况:

  1. 是剩余的数据继续发送,进行关闭协议交换;
  2. 就是超时过期,剩余的数据将被删除,进行 FIN-ACK 交换。

on = true, linger = 0

这种方式就是所谓hard-close,这个方式是讨论或者争论最多的用法,任何剩余的数据都被立即丢弃,并且 FIN-ACK 交换也不会发生,替代产生 RST ,让对端抛出connection reset的 SocketException 。

测试 sock 程序下载

下面程序演示会使用到 sock 程序,此程序可从 UNIX Network Programming, Volume 1, Second Edition: Networking APIs: Sockets and XTI, Prentice Hall, 1998, ISBN 0-13-490012-X 这里下载 sock 程序源码,另外我也在 github 上放了一份源码和在 Linux 平台上编译生成的可执行程序,不同平台可以参照 README 文件编译安装,或者直接从 sourceforge.net 下载二进制 sock 程序。

正常四次挥手关闭连接过程

发送 RST 异常终止一个连接

终止一个连接的正常方式是 TCP 连接中的一方发送 FIN,有时这也称为有序释放(orderly release),因为在所有排队数据都已发送之后才发送 FIN,正常情况下没有任何数据丢失。但也有可能发送一个复位报文段而不是 FIN 来中途释放一个连接,有时称这为异常释放(abortive release)。

异常终止一个连接对应用程序来说有两个优点:

  1. 丢弃任何待发数据并立即发送复位报文段;
  2. RST 的接收方会区分另一端执行的是异常关闭还是正常关闭,应用程序使用的 API 必须提供产生异常关闭而不是正常关闭的手段。

Socket API 通过linger on close选项(SO_LINGER)提供了这种异常关闭的能力,激活此选项并将停留时间设为0,这将导致连接关闭时进行复位 RST 而不是正常的 FIN,上述原文如下:

TCP/IP Illustrated - Aborting a Connection

We saw in Section 18.2 that the normal way to terminate a connection is for one side to send a FIN. This is sometimes called an orderly release since the FIN is sent after all previously queued data has been sent, and there is normally no loss of data. But it's also possible to abort a connection by sending a reset instead of a FIN. This is sometimes called an abortive release.

Aborting a connection provides two features to the application:

  1. any queued data is thrown away and the reset is sent immediately, and
  2. the receiver of the RST can tell that the other end did an abort instead of a normal close. The API being used by the application must provide a way to generate the abort instead of a normal close.

We can watch this abort sequence happen using our sock program. The sockets API provides this capability by using the "linger on close" socket option (SO_LINGER). We specify the -L option with a linger time of 0. This causes the abort to be sent when the connection is closed, instead of the normal FIN.

RST 终止连接示例

使用复位 RST 而不是 FIN 来异常终止一个连接示例如下,中间的命令是使用 sock 程序启动服务,最下面的命令是客户端输入Hello World之后按Ctrl-D输入文件结束符:

上图第 1~3 个数据包显示出建立连接的正常过程。第 5 个数据包发送我们键入的数据行(12 个字符和 Unix 换行符),第 6 个数据包是对收到数据的确认。

7 个数据包对应为终止客户程序而键入的文件结束符Ctrl-D。由于我们指明使用异常关闭而不是正常关闭(命令行中的-L0选项),因此最后的第 8 个数据包中客户端的 TCP 发送一个 RST 而不是通常的 FIN 。RST 报文段中包含一个序号和确认序号。需要注意的是 RST 报文段不会导致另一端产生任何响应,另一端根本不进行确认。收到 RST 的一方将终止该连接,并通知应用层连接复位。

stackoverflow.com 上关于 TCP linger=true, timeout=0 的讨论

SO_LINGER超时设置为0的典型原因是避免大量连接处于 TIME_WAIT 状态,因为这些过多 TIME_WAIT 连接,最终可能会阻止服务器打开新连接。

当关闭 TCP 连接时,发起关闭的一方(主动关闭)最终会在 TIME_WAIT 中连接几分钟,如果是服务器启动连接关闭的协议,并且涉及大量短连接,那么服务器可能容易受到过多 TIME_WAIT 连接问题的影响。

使用linger=true, timeout=0来避免 TIME_WAIT 连接数过多问题,这并不是一个好主意,TIME_WAIT 存在是有原因的(确保来自旧连接的数据包不会干扰新连接),如果可能的话,最好将协议重新设计为客户端启动连接关闭的协议。

要了解为什么 TIME_WAIT 状态是我们的朋友,请阅读 Stevens 等人的UNIX 网络编程第三版中的第2.7节,原版内容如下:

UNIX Network Programming - 2.7 TIME_WAIT State

Undoubtedly, one of the most misunderstood aspects of TCP with regard to network programming is its TIME_WAIT state. We can see in Figure 2.4 that the end that performs the active close goes through this state. The duration that this endpoint remains in this state is twice the maximum segment lifetime (MSL), sometimes called 2MSL.

Every implementation of TCP must choose a value for the MSL. The recommended value in RFC 1122 [Braden 1989] is 2 minutes, although Berkeley-derived implementations have traditionally used a value of 30 seconds instead. This means the duration of the TIME_WAIT state is between 1 and 4 minutes. The MSL is the maximum amount of time that any given IP datagram can live in a network. We know this time is bounded because every datagram contains an 8-bit hop limit (the IPv4 TTL field in Figure A.1 and the IPv6 hop limit field in Figure A.2) with a maximum value of 255. Although this is a hop limit and not a true time limit, the assumption is made that a packet with the maximum hop limit of 255 cannot exist in a network for more than MSL seconds.

The way in which a packet gets "lost" in a network is usually the result of routing anomalies. A router crashes or a link between two routers goes down and it takes the routing protocols seconds or minutes to stabilize and find an alternate path. During that time period, routing loops can occur (router A sends packets to router B, and B sends them back to A) and packets can get caught in these loops. In the meantime, assuming the lost packet is a TCP segment, the sending TCP times out and retransmits the packet, and the retransmitted packet gets to the final destination by some alternate path. But sometime later (up to MSL seconds after the lost packet started on its journey), the routing loop is corrected and the packet that was lost in the loop is sent to the final destination. This original packet is called a lost duplicate or a wandering duplicate. TCP must handle these duplicates.

There are two reasons for the TIME_WAIT state:

  1. To implement TCP's full-duplex connection termination reliably
  2. To allow old duplicate segments to expire in the network

The first reason can be explained by looking at Figure 2.5 and assuming that the final ACK is lost. The server will resend its final FIN, so the client must maintain state information, allowing it to resend the final ACK. If it did not maintain this information, it would respond with an RST (a different type of TCP segment), which would be interpreted by the server as an error. If TCP is performing all the work necessary to terminate both directions of data flow cleanly for a connection (its full-duplex close), then it must correctly handle the loss of any of these four segments.

This example also shows why the end that performs the active close is the end that remains in the TIME_WAIT state:

because that end is the one that might have to retransmit the final ACK.

To understand the second reason for the TIME_WAIT state, assume we have a TCP connection between 12.106.32.254 port 1500 and 206.168.112.219 port 21. This connection is closed and then sometime later, we establish another connection between the same IP addresses and ports: 12.106.32.254 port 1500 and 206.168.112.219 port 21. This latter connection is called an incarnation of the previous connection since the IP addresses and ports are the same. TCP must prevent old duplicates from a connection from reappearing at some later time and being misinterpreted as belonging to a new incarnation of the same connection. To do this, TCP will not initiate a new incarnation of a connection that is currently in the TIME_WAIT state. Since the duration of the TIME_WAIT state is twice the MSL, this allows MSL seconds for a packet in one direction to be lost, and another MSL seconds for the reply to be lost. By enforcing this rule, we are guaranteed that when we successfully establish a TCP connection, all old duplicates from previous incarnations of the connection have expired in the network.

There is an exception to this rule. Berkeley-derived implementations will initiate a new incarnation of a connection that is currently in the TIME_WAIT state if the arriving SYN has a sequence number that is "greater than" the ending sequence number from the previous incarnation. Pages 958–959 of TCPv2 talk about this in more detail. This requires the server to perform the active close, since the TIME_WAIT state must exist on the end that receives the next SYN. This capability is used by the rsh command. RFC 1185 [Jacobson, Braden, and Zhang 1990] talks about some pitfalls in doing this.

UNIX Network Programming - 7.4 SO_LINGER Socket Option

This option specifies how the close function operates for a connection-oriented protocol (e.g., for TCP and SCTP, but not for UDP). By default, close returns immediately, but if there is any data still remaining in the socket send buffer, the system will try to deliver the data to the peer.

The SO_LINGER socket option lets us change this default. This option requires the following structure to be passed between the user process and the kernel. It is defined by including <sys/socket.h>.

struct linger {    int l_onoff;  /* 0=off, nonzero=on */    int l_linger; /* linger time, POSIX specifies units as seconds */};

Calling setsockopt leads to one of the following three scenarios, depending on the values of the two structure members:

  1. If l_onoff is 0, the option is turned off. The value of l_linger is ignored and the previously discussed TCP default applies: close returns immediately.
  2. If l_onoff is nonzero and l_linger is zero, TCP aborts the connection when it is closed (pp. 1019–1020 of TCPv2). That is, TCP discards any data still remaining in the socket send buffer and sends an RST to the peer, not the normal four-packet connection termination sequence (Section 2.6). We will show an example of this in Figure 16.21. This avoids TCP's TIME_WAIT state, but in doing so, leaves open the possibility of another incarnation of this connection being created within 2MSL seconds (Section 2.7) and having old duplicate segments from the just-terminated connection being incorrectly delivered to the new incarnation. SCTP will also do an abortive close of the socket by sending an ABORT chunk to the peer (see Section 9.2 of [Stewart and Xie 2001]) when l_onoff is nonzero and l_linger is zero. Occasional USENET postings advocate the use of this feature just to avoid the TIME_WAIT state and to be able to restart a listening server even if connections are still in use with the server's well-known port. This should NOT be done and could lead to data corruption, as detailed in RFC 1337 [Braden 1992]. Instead, the SO_REUSEADDR socket option should always be used in the server before the call to bind, as we will describe shortly. The TIME_WAIT state is our friend and is there to help us (i.e., to let old duplicate segments expire in the network). Instead of trying to avoid the state, we should understand it (Section 2.7). There are certain circumstances which warrant using this feature to send an abortive close. One example is an RS-232 terminal server, which might hang forever in CLOSE_WAIT trying to deliver data to a struck terminal port, but would properly reset the stuck port if it got an RST to discard the pending data.
  3. If l_onoff is nonzero and l_linger is nonzero, then the kernel will linger when the socket is closed (p. 472 of TCPv2). That is, if there is any data still remaining in the socket send buffer, the process is put to sleep until either: (i) all the data is sent and acknowledged by the peer TCP, or (ii) the linger time expires. If the socket has been set to nonblocking (Chapter 16), it will not wait for the close to complete, even if the linger time is nonzero. When using this feature of the SO_LINGER option, it is important for the application to check the return value from close, because if the linger time expires before the remaining data is sent and acknowledged, close returns EWOULDBLOCK and any remaining data in the send buffer is discarded.

We now need to see exactly when a close on a socket returns given the various scenarios we looked at. We assume that the client writes data to the socket and then calls close. Figure 7.7 shows the default situation.

The SO_LINGER socket option gives us more control over when close returns and also lets us force an RST to be sent instead of TCP's four-packet connection termination sequence. We must be careful sending RSTs, because this avoids TCP's TIME_WAIT state. Much of the time, this socket option does not provide the information that we need, in which case, an application-level ACK is required.

Purposes for the TIME-WAIT state

The most known one is to prevent delayed segments from one connection being accepted by a later connection relying on the same quadruplet (source address, source port, destination address, destination port). The sequence number also needs to be in a certain range to be accepted. This narrows a bit the problem but it still exists, especially on fast connections with large receive windows. RFC 1337 explains in details what happens when the TIME-WAIT state is deficient. Here is an example of what could be avoided if the TIME-WAIT state wasn’t shortened:

The other purpose is to ensure the remote end has closed the connection. When the last ACK is lost, the remote end stays in the LAST-ACK state. Without the TIME-WAIT state, a connection could be reopened while the remote end still thinks the previous connection is valid. When it receives a SYN segment (and the sequence number matches), it will answer with a RST as it is not expecting such a segment. The new connection will be aborted with an error:

TCP 状态变化示意图

When to use SO_LINGER with timeout 0

Again, according to "UNIX Network Programming" third edition, setting SO_LINGER with timeout 0 prior to calling close() will cause the normal termination sequence not to be initiated.

Instead, the peer setting this option and calling close() will send a RST (connection reset) which indicates an error condition and this is how it will be perceived at the other end. You will typically see errors like Connection reset by peer.

Therefore, in the normal situation it is a really bad idea to set SO_LINGER with timeout 0 prior to calling close() – from now on called abortive close – in a server application.

However, certain situation warrants doing so anyway:

  1. If the a client of your server application misbehaves (times out, returns invalid data, etc.) an abortive close makes sense to avoid being stuck in CLOSE_WAIT or ending up in the TIME_WAIT state.
  2. If you must restart your server application which currently has thousands of client connections you might consider setting this socket option to avoid thousands of server sockets in TIME_WAIT (when calling close() from the server end) as this might prevent the server from getting available ports for new client connections after being restarted.
  3. On page 202 in the aforementioned book it specifically says:
    There are certain circumstances which warrant using this feature to send an abortive close. One example is an RS-232 terminal server, which might hang forever in CLOSE_WAIT trying to deliver data to a stuck terminal port, but would properly reset the stuck port if it got an RST to discard the pending data.
  4. I would recommend this long article which I believe gives a very good answer to your question.

As mentioned previously, the TIME_WAIT state is intended to allow any datagrams lingering from a closed connection to be discarded. During this period, the waiting TCP usually has little to do; it merely holds the state until the 2MSL timer expires.

Linux TCP SO_LINGER 相关源码

https://github.com/torvalds/linux/blob/v5.0/net/ipv4/tcp.c 摘录部分源码内容,关于abort call源码注释说明如下:

tcp.c
203204
*  Salvatore Sanfilippo    :   Support SO_LINGER with linger == 1 and*                              lingertime == 0 (RFC 793 ABORT Call)

关于 TCP 状态的说明注释部分:

213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245
* Description of States:**  TCP_SYN_SENT        sent a connection request, waiting for ack**  TCP_SYN_RECV        received a connection request, sent ack,*                      waiting for final ack in three-way handshake.**  TCP_ESTABLISHED     connection established**  TCP_FIN_WAIT1       our side has shutdown, waiting to complete*                      transmission of remaining buffered data**  TCP_FIN_WAIT2       all buffered data sent, waiting for remote*                      to shutdown**  TCP_CLOSING         both sides have shutdown but we still have*                      data we have to finish sending**  TCP_TIME_WAIT       timeout to catch resent junk before entering*                      closed, can only be entered from FIN_WAIT2*                      or CLOSING.  Required because the other end*                      may not have gotten our last ACK causing it*                      to retransmit the data packet (which we ignore)**  TCP_CLOSE_WAIT      remote side has shutdown and is waiting for*                      us to finish writing our data and to shutdown*                      (we have to close() to move on to LAST_ACK)**  TCP_LAST_ACK        out side has shutdown after remote has*                      shutdown.  There may still be data in our*                      buffer that we have to finish sending**  TCP_CLOSE           socket is finished

tcp_close(socket, timeout)方法源码如下:

23202321232223232324232523262327232823292330233123322333233423352336233723382339234023412342234323442345234623472348234923502351235223532354235523562357235823592360236123622363236423652366236723682369237023712372237323742375237623772378237923802381238223832384238523862387238823892390239123922393239423952396239723982399240024012402240324042405240624072408240924102411241224132414241524162417241824192420242124222423242424252426242724282429243024312432243324342435243624372438243924402441244224432444244524462447244824492450245124522453245424552456245724582459246024612462246324642465246624672468246924702471247224732474247524762477247824792480248124822483248424852486248724882489
void tcp_close(struct sock *sk, long timeout){    struct sk_buff *skb;    int data_was_unread = 0;    int state;    lock_sock(sk);    sk->sk_shutdown = SHUTDOWN_MASK;    if (sk->sk_state == TCP_LISTEN) {        tcp_set_state(sk, TCP_CLOSE);        /* Special case. */        inet_csk_listen_stop(sk);        goto adjudge_to_death;    }    /*  We need to flush the recv. buffs.  We do this only on the     *  descriptor close, not protocol-sourced closes, because the     *  reader process may not have drained the data yet!     */    while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {        u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq;        if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)            len--;        data_was_unread += len;        __kfree_skb(skb);    }    sk_mem_reclaim(sk);    /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */    if (sk->sk_state == TCP_CLOSE)        goto adjudge_to_death;    /* As outlined in RFC 2525, section 2.17, we send a RST here because     * data was lost. To witness the awful effects of the old behavior of     * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk     * GET in an FTP client, suspend the process, wait for the client to     * advertise a zero window, then kill -9 the FTP client, wheee...     * Note: timeout is always zero in such a case.     */    if (unlikely(tcp_sk(sk)->repair)) {        sk->sk_prot->disconnect(sk, 0);    } else if (data_was_unread) {        /* Unread data was tossed, zap the connection. */        NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);        tcp_set_state(sk, TCP_CLOSE);        tcp_send_active_reset(sk, sk->sk_allocation);    } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {        /* Check zero linger _after_ checking for unread data. */        sk->sk_prot->disconnect(sk, 0);        NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);    } else if (tcp_close_state(sk)) {        /* We FIN if the application ate all the data before         * zapping the connection.         */        /* RED-PEN. Formally speaking, we have broken TCP state         * machine. State transitions:         *         * TCP_ESTABLISHED -> TCP_FIN_WAIT1         * TCP_SYN_RECV -> TCP_FIN_WAIT1 (forget it, it's impossible)         * TCP_CLOSE_WAIT -> TCP_LAST_ACK         *         * are legal only when FIN has been sent (i.e. in window),         * rather than queued out of window. Purists blame.         *         * F.e. "RFC state" is ESTABLISHED,         * if Linux state is FIN-WAIT-1, but FIN is still not sent.         *         * The visible declinations are that sometimes         * we enter time-wait state, when it is not required really         * (harmless), do not send active resets, when they are         * required by specs (TCP_ESTABLISHED, TCP_CLOSE_WAIT, when         * they look as CLOSING or LAST_ACK for Linux)         * Probably, I missed some more holelets.         *                      --ANK         * XXX (TFO) - To start off we don't support SYN+ACK+FIN         * in a single packet! (May consider it later but will         * probably need API support or TCP_CORK SYN-ACK until         * data is written and socket is closed.)         */        tcp_send_fin(sk);    }    sk_stream_wait_close(sk, timeout);adjudge_to_death:    state = sk->sk_state;    sock_hold(sk);    sock_orphan(sk);    local_bh_disable();    bh_lock_sock(sk);    /* remove backlog if any, without releasing ownership. */    __release_sock(sk);    percpu_counter_inc(sk->sk_prot->orphan_count);    /* Have we already been destroyed by a softirq or backlog? */    if (state != TCP_CLOSE && sk->sk_state == TCP_CLOSE)        goto out;    /*  This is a (useful) BSD violating of the RFC. There is a     *  problem with TCP as specified in that the other end could     *  keep a socket open forever with no application left this end.     *  We use a 1 minute timeout (about the same as BSD) then kill     *  our end. If they send after that then tough - BUT: long enough     *  that we won't make the old 4*rto = almost no time - whoops     *  reset mistake.     *     *  Nope, it was not mistake. It is really desired behaviour     *  f.e. on http servers, when such sockets are useless, but     *  consume significant resources. Let's do it with special     *  linger2 option.                 --ANK     */    if (sk->sk_state == TCP_FIN_WAIT2) {        struct tcp_sock *tp = tcp_sk(sk);        if (tp->linger2 < 0) {            tcp_set_state(sk, TCP_CLOSE);            tcp_send_active_reset(sk, GFP_ATOMIC);            __NET_INC_STATS(sock_net(sk),                    LINUX_MIB_TCPABORTONLINGER);        } else {            const int tmo = tcp_fin_time(sk);            if (tmo > TCP_TIMEWAIT_LEN) {                inet_csk_reset_keepalive_timer(sk,                        tmo - TCP_TIMEWAIT_LEN);            } else {                tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);                goto out;            }        }    }    if (sk->sk_state != TCP_CLOSE) {        sk_mem_reclaim(sk);        if (tcp_check_oom(sk, 0)) {            tcp_set_state(sk, TCP_CLOSE);            tcp_send_active_reset(sk, GFP_ATOMIC);            __NET_INC_STATS(sock_net(sk),                    LINUX_MIB_TCPABORTONMEMORY);        } else if (!check_net(sock_net(sk))) {            /* Not possible to send reset; just close */            tcp_set_state(sk, TCP_CLOSE);        }    }    if (sk->sk_state == TCP_CLOSE) {        struct request_sock *req = tcp_sk(sk)->fastopen_rsk;        /* We could get here with a non-NULL req if the socket is         * aborted (e.g., closed with unread data) before 3WHS         * finishes.         */        if (req)            reqsk_fastopen_remove(sk, req, false);        inet_csk_destroy_sock(sk);    }    /* Otherwise, socket is reprieved until protocol close. */out:    bh_unlock_sock(sk);    local_bh_enable();    release_sock(sk);    sock_put(sk);}

上面源码第 2371 行是 linger 结构体的字段 l_onoff 为1而 l_linger 为0的情况,此时调用sk->sk_prot->disconnect(sk, 0) -> tcp_disconnect()函数,丢失所有接收数据并且直接断开连接,具体也就是发送 RST 数据包,清空相关接收队列;第 2375 行到第 2405 行代码属于正常的结束流程,即四次挥手,此时需先调用函数tcp_close_state()切换状态,并判断是否需要发送 FIN 数据包(比如,如果当前还处于 TCP_SYN_SENT 状态,连接尚未完全建立,自然就不用发送 FIN 数据包),如果需要发送 FIN 数据包则调用tcp_send_fin()函数。

关于 TIME_WAIT 状态优化

网上常可以看到类似如下的优化设置:

 echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout# only for positive close endpoint echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse# don't run below cmd echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

关于 tcp_tw_recycle

上面的设置有问题,尤其是tcp_tw_recycle,并且从 Linux-4.12 之后移除了tcp_tw_recycle这个选项,更多细节可参考此文章 Coping with the TCP TIME-WAIT state on busy Linux servers

关于 tcp_tw_reuse 参数说明

TIME-WAIT 状态是为了防止不相关的延迟请求包被接受。但在某些特定条件下很有可能出现新建立的 TCP 连接请求包被老连接(同样的四元组,暂时还是 TIME-WAIT 状态,回收中)错误处理。

RFC1323 提供了一组 TCP 扩展,以提高高带宽路径的性能。除此之外,它还定义了一个带有两个四字节时间戳字段的新 TCP 选项,第一个是 TCP 发送方的当前时钟时间戳,而第二个是从远程主机接收到的最新时间戳。

通过启用net.ipv4.tcp_tw_reuse,如果新时间戳严格大于为先前连接记录的最新时间戳,则 Linux 将重新使用 TIME-WAIT 状态的现有连接用于新的传出连接:TIME-WAIT 状态中的连接可在一秒钟后重复使用。

tcp_tw_reuse 安全性

如果另一端发过来的 FIN 包接及时收到,本地端的 ACK 包也被发送出去,则本地端进入 TIME-WAIT 状态。一旦新的连接替换了状态为 TIME-WAIT 的旧连接,新连接的 SYN 包会被另一端忽略掉(由于时间戳 timestamps),也不会应答 RST 包(注意与前面那张使用相同的四元组打开的新连接被 RST 的区别,如果没有时间戳这里直接返回 RST 包并关闭连接),而是通过重新传输 FIN 段来应答,如下图所示。FIN 包将会收到一个 RST 包的应答(因为本地连接是 SYN-SENT 状态),这会让远程端跳过 LAST-ACK 状态。最初的 SYN 段最终将被重新发送(一秒钟后),因为没有应答,并且建立连接时没有明显的错误,除了稍微延迟。

在客户端(尤其是服务器上,某服务以客户端形式运行时,比如 nginx 反向代理)上启用net.ipv4.tcp_tw_reuse,还算是安全的解决 TIME-WAIT 的方案。

MSL 设置

MSL(Maximum Segment Lifetime)是最大分节生命期,一般为60 秒(linux),120 秒(Windows),有些系统则是30 秒,Windows 也是建议为30 秒

Linux MSL 查看和设置

 cat /proc/sys/net/ipv4/tcp_fin_timeout# setting echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout

Windows MSL 设置

注册表项HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parametersvalue设置为TcpTimedWaitDelayREG_DWORD类型,默认值为 16 进制的0x78,即120 秒,推荐值为30 秒

Java server/client example

Java 代码演示 SO_LINGER 参数设置,关键代码socket.setSoLinger(true, 0),先启动服务端,再启动客户端程序:

SocketTcpSoLingerServer.java

public class SocketTcpSoLingerServer {    private static final Logger LOGGER = LoggerFactory.getLogger(SocketTcpSoLingerServer.class);    private static final int PORT = 8888;    public static void main(String[] args) throws IOException {        ServerSocket serverSocket = new ServerSocket();        serverSocket.bind(new InetSocketAddress(PORT));        LOGGER.info("server startup at {}", PORT);        byte[] data = new byte[32];        while (true) {            Socket socket = serverSocket.accept();            LOGGER.info("1. socket so linger : {}", socket.getSoLinger());            socket.setSoLinger(true, 0);            // socket.setSoLinger(true, 100);            LOGGER.info("2. socket so linger : {}", socket.getSoLinger());            try (InputStream in = socket.getInputStream();                 OutputStream out = socket.getOutputStream()) {                int i = 1;                while (true) {                    try {                        int read = in.read(data);                        if (read > 0) {                            String line = new String(data, 0, read);                            // 延时返回,客户端要发送 FIN 包,即调用 socket.shutdownOutput()                            TimeUnit.MILLISECONDS.sleep(1500L);                            LOGGER.info("{} : {}", i++, line);                            out.write(line.getBytes());                        } else if (read == -1) {                            LOGGER.info("close socket");                            socket.close();                            break;                        }                    } catch (IOException e) {                        LOGGER.error("close socket for error", e);                        socket.close();                        break;                    } catch (InterruptedException e) {                        e.printStackTrace();                    }                }            }        }    }}

SocketTcpSoLingerClient.java

public class SocketTcpSoLingerClient {    private static final Logger LOGGER = LoggerFactory.getLogger(SocketTcpSoLingerClient.class);    private static final int PORT = 8888;    public static void main(String[] args) throws IOException {        Socket socket = new Socket("localhost", PORT);        try (InputStream socketInputStream = socket.getInputStream();             OutputStream socketOutputStream = socket.getOutputStream()) {            String head = "hello ";            String body = "world";            socketOutputStream.write(head.getBytes());            socketOutputStream.write(body.getBytes());            // 如果客户端不关闭输出,则服务器和客户端都收不到关闭的 FIN 包,从而连接会一直保持            boolean shutdownOutput = true;            // boolean shutdownOutput = false;            if (shutdownOutput) {                // 这里调用了 socket.shutdownOutput()返回时,hello 和 world 未必已经成功发送到对方了                socket.shutdownOutput();            } else {                // 用定时器来延时关闭连接                // 等 3 秒是为了让服务器全部输出完成,双方一起关闭                final Timer timer = new Timer();                timer.schedule(new TimerTask() {                    @Override                    public void run() {                        try {                            LOGGER.info("socket shutdown output in timer : {}", timer);                            socket.shutdownOutput();                            timer.cancel();                        } catch (IOException e) {                            LOGGER.error("socket shutdown error", e);                        }                    }                }, 3000L);            }            LOGGER.info("socket shutdown output");            int i = 1;            byte[] data = new byte[32];            while (true) {                int read = socketInputStream.read(data);                if (read > 0) {                    String line = new String(data, 0, read);                    LOGGER.info("{} : {}", i++, line);                } else if (read == -1) {                    LOGGER.info("socket closed");                    socket.close();                    break;                }            }        }    }}

数据抓包截图

上面的 Java 程序运行后,抓包截图如下,可以看到第 9 行在客户端发送 FIN 包之后,TCP 连接形成半闭状态,服务端仍然在发送数据给客户端,最后客户端 ACK 确认接收完数据,服务器不是正常回应以 FIN 包,而是一个 RST 包,双方完成连接关闭:

总结 SO_LINGER 参数用法

慎重使用on=true, linger=0,使用 RST 代替 FIN 直接强制关闭连接,主动关闭的一方也不会进入 TIME_WAIT 阶段,会减少系统的连接数,提高并发连接能力,但是这种异常关闭连接的方式,TCP 连接关闭的 TIME_WAIT 的作用也就没有了,是个有利有弊的用法,尽量不要使用,而是通过设计应用层协议来避免 TIME_WAIT 连接过多的问题。

References

  1. Linux TCP 源码
  2. Chapter 2. The Transport Layer: TCP, UDP, and SCTP
  3. Coping with the TCP TIME-WAIT state on busy Linux servers
  4. 4.5 listen Function
  5. TCP Connection Establishment and Termination
  6. 第 18 章 TCP 连接的建立与终止
  7. Settings that can be Modified to Improve Network Performance
  8. When to use SO_LINGER with timeout 0
  9. Socket 选项系列之 SO_LINGER
  10. Adjust the MaxUserPort and TcpTimedWaitDelay settings
  11. tcp: remove tcp_tw_recycle
  12. 不要在 linux 上启用 net.ipv4.tcp_tw_recycle 参数
  13. tcp connection open and close