Linux Network Subsystem Configuring

Calculating the TCP Window for the Bandwidth Delay Product

TCP performance depends on several factors. The two most important are link bandwidth (the speed of packets transmitted over the network), and round-trip time (RTT, the delay between sending a segment and receiving a notification from the node). These two values define so-called Bandwidth Delay Product (BDP).

Knowing network bandwidth and RTT gives you information for BDP calculation. BDP provides an easy way to calculate the optimal (in theory) TCP socket buffer size. If the buffer is too small, the TCP window can’t fully open, and it’ll limit performance. If the buffer is oversized, valuable RAM will be spent without reason. The most optimal value lets you use all the available bandwidth and minimize resources usage.

BDP calculation can be made this way:

BDP = link_bandwidth × RTT

For example, if your application interacts through a 100Mbps network with RTT 50ms, BDP will be:

100 Mbps × 0.050 sec ÷ 8 = 0.625 MB = 625 KB

BDP is defined in (K/M/G) bytes, while the bandwidth is defined in (K/M/G) bits / sec . To cast types, divide bandwidth by 8, as in example above.

So, your TCP window should be set to BDP (625 KB). But by default the TCP window in Linux 2.6 is 110 KB, which limits the connection bandwidth to ~18 Mbps:

bandwidth = window size ÷ RTT

110 KB ÷ 0.050 × 8 = 17.6 Mbps

TCP/IP Stack Configuration for Linux

A standard Linux installation tries to be optimized for a wide range of applications. This means that the standard distribution may not be optimal for your environment.

GNU/Linux provides a wide range of configurable kernel parameters that you can use for the dynamic operation system configuring for your specific applications. Let’s look at some of the most important options that affect the performance of sockets.

Configurable kernel parameters exist inside the /proc virtual file system. Each file in it represents one or more parameters, which can be read by the cat utility or changed with the echo command. You can also use the sysctl utility to configure it.

net.core.rmem_default, byte

Default value: 110592

Defines the default receive window size; affects only protocols other than TCP. For TCP redefined in tcp_rmem.

The size should be increased for larger BDP values.

Recommendation: set the value equal to the minimum BDP per client (i.e. based on the bandwidth that should be allocated to every client at any load).

net.core.rmem_max, byte 

Default value: 110592

Defines the maximum size of the receive window; affects all protocols, including TCP. For TCP, the maximum value in tcp_rmem cannot exceed the value of this parameter.

The size should be increased for larger BDP values.

Recommendation: set the value equal to the minimum BDP per client (i.e. based on the bandwidth that should be allocated to every client at any load).

net.core.wmem_default, byte 

Default value: 110592

Defines the default transmit window size; affects only protocols other than TCP. For TCP, it is redefined in tcp_wmem.

The size should be increased for larger BDP values.

Recommendation: according to IBM documentation the value is always set to 64 KB (65536). Otherwise, the send buffer ring (TX) overflows may occur.

net.core.wmem_max, byte 

Default value: 110592

Defines the maximum size of the transmit window; affects all protocols, including TCP. For TCP, the maximum value in tcp_wmem cannot exceed the value of this parameter.

The size should be increased for larger BDP values.

Recommendation: set the value to the maximum host BDP (i.e. based on the planned host bandwidth).

net.ipv4.tcp_window_scaling, boolean (1/0)

Default value: 1

Enables window scaling, as defined in RFC1323. It should be enabled to support windows larger than 64 KB.

net.ipv4.tcp_sack, boolean (1/0)

Default value: 1

Activates a selective acknowledgment, which improves performance by selectively confirming packets received out of turn (as a result, the sender retransmits only the missed segments).

May increase the CPU load.

ATTENTION: increases the potentially susceptibility to DoS attacks.

net.ipv4.tcp_fack, boolean (1/0)

Default value: 1

Enables Forward Acknowledgment, which operates with Selective Acknowledgment (SACK) to reduce congestion.

Recommendation: must be enabled when tcp_sack is enabled.

net.ipv4.tcp_timestamps, boolean (1/0)

Default value: 1

Activates RTT calculation in a more accurate way (see RFC1323) than the retransmission interval.

Recommendation: should be enabled to improve performance.

ATTENTION: sometimes causes interaction problems between two hosts. Repeatedly there was a situation when it was impossible to establish an SSH connection to the server when the tcp_timestamps was enabled on the client.

net.ipv4.tcp_mem, memory page 

Default value: “24576 32768 49152”

Defines the memory limit for the TCP stack and memory usage behavior of the TCP stack (total memory usage, regardless of the sockets number).

The value is defined in memory pages (usually 4 KB).

Increase the value for large BDP (but remember, it is defined in memory pages, not in bytes).

The first value is the lower memory usage limit (initially allocated memory capacity). Recommendation: set the value to BDP ÷ PageSize.

The second value is the threshold for the memory multiplex mode, upon which the multiplexing to use the buffer is activated. Recommendation: set the value to BDP × 2 ÷ PageSize.

The third value is the maximum limit. At this level, packets can be skipped to reduce memory usage. Recommendation: set the value to BDP × 4 ÷ PageSize.

net.ipv4.tcp_wmem, memory page 

Default value: “4096 16384 131072”

Specifies the memory usage for each socket for automatic configuration.

This setting is important for content stream servers, where each client should receive data at a speed not lower than the speed of the audio / video stream.

The first value is the minimum number of bytes allocated for the socket send buffer. Recommendation: set the value to 4 KB.

The second value is the default value (overrides wmem_default), to which the buffer size can grow at low system load. Recommendation: set the value to BDP per client (based on the minimum required speed of delivery to the client).

The third value is the maximum send buffer length (overrides wmem_max, but cannot exceed it). With static soft memory allocation by the SO_RCVBUF socket option this parameter is ignored. Recommendation: set the value to BDP per client (based on the desired speed of delivery to the client).

net.ipv4.tcp_rmem, memory page 

Default value: “4096 87380 174760”

Same as tcp_wmem, except that it refers to receive buffers for automatic tuning.

net.ipv4.tcp_low_latency, boolean (1/0)

Default value: 0

Allows the TCP/IP stack to prefer a low latency than a higher bandwidth.

Recommendation: should be disabled.

net.ipv4.tcp_westwood, boolean (1/0)

Default value: 0

Activates the sender’s congestion control algorithm, which supports estimated throughput values and tries to optimize the full use of bandwidth.

This parameter is also useful for wireless interfaces, since packet losses can be caused not by overloading.

Recommendation: must be enabled for WAN connections.

net.ipv4.tcp_bic, boolean (1/0)

Default value: 1

Activates Binary Increase Congestion for fast, long-distance networks.

Improves the use of links operating at gigabit speeds.

Recommendation: must be enabled for WAN connections.

net.ipv4.tcp_max_orphans, integer

Default value: 65536

Maximum number of TCP sockets allowed in the system that are not associated with any user file handle. Each orphan-connection absorbs about 64 Kbytes of unswappable memory.

When the threshold is reached, the orphan-connections are immediately discarded with a warning.

Do not reduce the threshold value (rather increase it in accordance with the system requirements, for example, after additional memory installing).

This threshold helps prevent simple DoS attacks.

net.ipv4.tcp_fin_timeout, integer 

Default value: 60

Specifies the remaining time of a socket in the FIN-WAIT-2 state after it is closed by the local side. The partner may never close this connection, so you should close it on your own initiative after the timeout expires. The default timeout is 60 seconds.

The 2.2 series kernels typically used a value of 180 seconds, and you can keep that value, but be aware that you run the risk of wasting a lot of memory to save the half-ruptured dead connections on the high loaded servers. Sockets in the FIN-WAIT-2 state are less dangerous than FIN-WAIT-1, since they absorb no more than 1.5 KB of memory, but they can exist longer.

net.ipv4.tcp_max_syn_backlog, integer 

Maximum number of stored connection requests for which no confirmation was received from the connecting client (i.e., the TCP SYN came from the client, the server answered SYN/ACK, and the next ACK was not received).

If overloads occur on the server, you can try to increase this value.

Too large backlog may be the result of a SYN-flood attack. In this case, the inclusion of SYN cookies will help. If you are sure that the attack is not performed, then you need to increase the value to the appropriate one.

net.ipv4.tcp_synack_retries, integer 

Number of retry attempts to send SYNACK packets for passive TCP connections (resends will be made if the response ACK is not received, which may be the result of a SYN-flood attack).

The number of attempts should not exceed 255.

A value of 5 corresponds to approximately 180 seconds for connection attempts.

net.ipv4.tcp_syncookies, boolean (1/0) 

The flag indicating whether to enable the SYN-Cookies mechanism when the SYN-Flood attack is detected.

net.ipv4.tcp_tw_reuse, boolean (1/0) 

Default value: 0

In case the tcp_tw_reuse flag is set, sockets in the TIME_WAIT state can be reused before the timeout expires. The kernel will try to eliminate collisions based on the TCP Sequence Number.

Activated tcp_timestamps excludes collisions, but it must be enabled on both sides (server and client).

Recommendation: must be activated on streaming servers.

net.ipv4.tcp_tw_recycle, boolean (1/0) 

Default value: 0

In case the tcp_tw_recycle flag is set, the kernel will make assumptions about the possibility to reuse the TIME_WAIT-stated socket basing on the remote host timestamps. The kernel will monitor the last sent timstamp values from remote hosts that “hold” connections in the TIME_WAIT state, and will allow to reuse the socket if the timestamp increases correctly.

At the same time, if the timestamp changed incorrectly (for example, “rolled back”), the SYN packet will be “silently” discarded and the connection will not be established.

CONTENTS
Sign-in
Sign-in with your SmartLabs Support Portal account credentials to see non-public articles.