Learn where some of the network sysctl variables fit into the Linux/Kernel network flow. Translations: 🇷🇺
BSD-3-CLAUSE License
Sometimes people are looking for sysctl cargo cult values that bring high throughput and low latency with no trade-off and that works on every occasion. That's not realistic, although we can say that the newer kernel versions are very well tuned by default. In fact, you might hurt performance if you mess with the defaults.
This brief tutorial shows where some of the most used and quoted sysctl/network parameters are located into the Linux network flow, it was heavily inspired by the illustrated guide to Linux networking stack and many of Marek Majkowski's posts.
Feel free to send corrections and suggestions! :)
MAC
(if not on promiscuous mode) and FCS
and decide to drop or to continuerx
until rx-usecs
timeout or rx-frames
hard IRQ
IRQ handler
that runs the driver's codeschedule a NAPI
, clear the hard IRQ
and returnsoft IRQ (NET_RX_SOFTIRQ)
netdev_budget_usecs
timeout or netdev_budget
and dev_weight
packetssk_buff
netif_receive_skb
)skb
to taps (i.e. tcpdump) and pass it to tc ingressnetdev_max_backlog
with its algorithm defined by default_qdisc
ip_rcv
and packets are handed to IPPREROUTING
)LOCAL_IN
)tcp_v4_rcv
)tcp_rmem
rules
tcp_moderate_rcvbuf
is enabled kernel will auto-tune the receive buffersendmsg
or other)tcp_wmem
sizeipv4
on tcp_write_xmit
and tcp_transmit_skb
)ip_queue_xmit
) does its work: build ip header and call netfilter (LOCAL_OUT
)POST_ROUTING
)ip_output
)dev_queue_xmit
)txqueuelen
length with its algorithm default_qdisc
ring buffer tx
soft IRQ (NET_TX_SOFTIRQ)
after tx-usecs
timeout or tx-frames
hard IRQ
to signal its completionsoft IRQ
) the NAPI poll systemIf you want to see the network tracing within Linux you can use perf.
docker run -it --rm --cap-add SYS_ADMIN --entrypoint bash ljishen/perf
apt-get update
apt-get install iputils-ping
# this is going to trace all events (not syscalls) to the subsystem net:* while performing the ping
perf trace --no-syscalls --event 'net:*' ping globo.com -c1 > /dev/null
ethtool -g ethX
ethtool -G ethX rx value tx value
ethtool -S ethX | grep -e "err" -e "drop" -e "over" -e "miss" -e "timeout" -e "reset" -e "restar" -e "collis" -e "over" | grep -v "\: 0"
ethtool -c ethX
ethtool -C ethX rx-usecs value tx-usecs value
cat /proc/interrupts
netdev_budget_usecs
have elapsed during the poll cycle or the number of packets processed reaches netdev_budget
.dropped
(# of packets that were dropped because netdev_max_backlog
was exceeded) and squeezed
(# of times ksoftirq ran out of netdev_budget
or time slice with work remaining).sysctl net.core.netdev_budget_usecs
sysctl -w net.core.netdev_budget_usecs value
cat /proc/net/softnet_stat
; or a better tool
netdev_budget
is the maximum number of packets taken from all interfaces in one polling cycle (NAPI poll). In one polling cycle interfaces which are registered to polling are probed in a round-robin manner. Also, a polling cycle may not exceed netdev_budget_usecs
microseconds, even if netdev_budget
has not been exhausted.sysctl net.core.netdev_budget
sysctl -w net.core.netdev_budget value
cat /proc/net/softnet_stat
; or a better tool
dev_weight
is the maximum number of packets that kernel can handle on a NAPI interrupt, it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware aggregated packet is counted as one packet in this.sysctl net.core.dev_weight
sysctl -w net.core.dev_weight value
cat /proc/net/softnet_stat
; or a better tool
netdev_max_backlog
is the maximum number of packets, queued on the INPUT side (the ingress qdisc), when the interface receives packets faster than kernel can process them.sysctl net.core.netdev_max_backlog
sysctl -w net.core.netdev_max_backlog value
cat /proc/net/softnet_stat
; or a better tool
txqueuelen
is the maximum number of packets, queued on the OUTPUT side.ip link show dev ethX
ip link set dev ethX txqueuelen N
ip -s link
default_qdisc
is the default queuing discipline to use for network devices.sysctl net.core.default_qdisc
sysctl -w net.core.default_qdisc value
tc -s qdisc ls dev ethX
The policy that defines what is memory pressure is specified at tcp_mem and tcp_moderate_rcvbuf.
tcp_rmem
- min (size used under memory pressure), default (initial size), max (maximum size) - size of receive buffer used by TCP sockets.sysctl net.ipv4.tcp_rmem
sysctl -w net.ipv4.tcp_rmem="min default max"
; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)cat /proc/net/sockstat
tcp_wmem
- min (size used under memory pressure), default (initial size), max (maximum size) - size of send buffer used by TCP sockets.sysctl net.ipv4.tcp_wmem
sysctl -w net.ipv4.tcp_wmem="min default max"
; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)cat /proc/net/sockstat
tcp_moderate_rcvbuf
- If set, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer.sysctl net.ipv4.tcp_moderate_rcvbuf
sysctl -w net.ipv4.tcp_moderate_rcvbuf value
cat /proc/net/sockstat
Accept and SYN Queues are governed by net.core.somaxconn and net.ipv4.tcp_max_syn_backlog. Nowadays net.core.somaxconn caps both queue sizes.
sysctl net.core.somaxconn
- provides an upper limit on the value of the backlog parameter passed to the listen()
function, known in userspace as SOMAXCONN
. If you change this value, you should also change your application to a compatible value (i.e. nginx backlog).cat /proc/sys/net/ipv4/tcp_fin_timeout
- this specifies the number of seconds to wait for a final FIN packet before the socket is forcibly closed. This is strictly a violation of the TCP specification but required to prevent denial-of-service attacks.cat /proc/sys/net/ipv4/tcp_available_congestion_control
- shows the available congestion control choices that are registered.cat /proc/sys/net/ipv4/tcp_congestion_control
- sets the congestion control algorithm to be used for new connections.cat /proc/sys/net/ipv4/tcp_max_syn_backlog
- sets the maximum number of queued connection requests which have still not received an acknowledgment from the connecting client; if this number is exceeded, the kernel will begin dropping requests.cat /proc/sys/net/ipv4/tcp_syncookies
- enables/disables syn cookies, useful for protecting against syn flood attacks.cat /proc/sys/net/ipv4/tcp_slow_start_after_idle
- enables/disables tcp slow start.How to monitor:
netstat -atn | awk '/tcp/ {print $6}' | sort | uniq -c
- summary by statess -neopt state time-wait | wc -l
- counters by a specific state: established
, syn-sent
, syn-recv
, fin-wait-1
, fin-wait-2
, time-wait
, closed
, close-wait
, last-ack
, listening
, closing
netstat -st
- tcp stats summarynstat -a
- human-friendly tcp stats summarycat /proc/net/sockstat
- summarized socket statscat /proc/net/tcp
- detailed stats, see each field meaning at the kernel docs
cat /proc/net/netstat
- ListenOverflows
and ListenDrops
are important fields to keep an eye on
cat /proc/net/netstat | awk '(f==0) { i=1; while ( i<=NF) {n[i] = $i; i++ }; f=1; next} \ (f==1){ i=2; while ( i<=NF){ printf "%s = %d\n", n[i], $i; i++}; f=0} ' | grep -v "= 0
; a human readable /proc/net/netstat
Source: https://commons.wikimedia.org/wiki/File:Tcp_state_diagram_fixed_new.svg