Performance Optimization

Disk Subsystem

When using storage on local disks or RAID arrays you should paid an attention to I/O subsystem settings. Below are some recommendations that are applicable in most cases.

Use the deadline scheduler for all block devices responsible for storing content

You can install the scheduler for the sdX device with the following command:

echo ‘deadline’ > /sys/block/sdX/queue/scheduler

Do not forget that these settings will be reset and need to be made again after the OS restart. To persist, you can add them to the file /etc/rc.local or create an init.d script.

To configure the scheduler globally, add the elevator=deadline parameter to the kernel configuration string GRUB_CMDLINE_LINUX in the file /etc/default/grub, then rebuild the grub configuration (grub.cfg). For example:

GRUB_TIMEOUT=5

GRUB_DISTRIBUTOR=$(sed’s, release .*$,,g’ /etc/system-release)

GRUB_DEFAULT=saved

GRUB_DISABLE_SUBMENU=true

GRUB_TERMINAL_OUTPUT=“console”

GRUB_CMDLINE_LINUX=“vconsole.font=latarcyrheb-sun16 vconsole.keymap=us rd.lvm.lv=vgroot/root elevator=deadline crashkernel=auto rhgb quiet”

GRUB_DISABLE_RECOVERY=“true”

You can rebuild the bootloader configuration with these commands:

  • on servers with BIOS: # grub2-mkconfig -o /boot/grub2/grub.cfg
  • on servers with UEFI (RHEL7): # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  • on servers with UEFI (CentOS7): # grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Ensure that you have enough RAM

The amount of RAM should be sufficient to accommodate the “hot” data of OS’s PageCache. As you know, the amount of memory reserved for PageCache can be recognized using the command free (column “buff/cache” or “cache”, depending on OS version):

# free -m

             total        used        free      shared  buff/cache   available

Mem:         257756        5999        2003      241580      249753        8852

Swap:             0           0           0

Zero or low values (in conjunction with the zero value of the “free” column) can be the sign of low memory and associated low performance.

Avoid using swap. If it cannot be disabled, reduce the probability of using it:

sysctl vm.swappiness=0

echo “vm.swappiness=0”>> /etc/sysctl.conf

Select the appropriate readahead values

Read Ahead can increase the speed of data access by reading multiple blocks in a row instead of reading one requested data block. Reading is done into the PageCache (we do not consider the case with the O_DIRECT flag set, because it is inappropriate for the expected load), so the query of the next block will not lead to an I/O operation, but will be processed from the RAM.

Setting too small readahead values will increase the number of IOPS for the same number of read requests. Setting too large values can result in “knocking out” the necessary data from the PageCache before it was used, by unnecessary blocks (those ones that will not be used or will be used after they will be “knocked out” from the cache by another request).

The readahead value can be applied to a block device. You can find out the current value for the sdX device as follows:

blockdev –getra /dev/sdX

Set the value for the sdX device:

blockdev –setra N /dev/sdX

where N is the number of sectors (512 bytes).

Do not forget that these settings will be reset and need to be made again after the OS restart. To persist, you can add them to the file /etc/rc.local or create an init.d script.

Use the optimal file system settings

Disable files access time modification (atime) and write boundaries:

mount -o remount,noatime,nobarrier /your/mountpoint

Do not forget to make the appropriate changes to /etc/fstab.

If you are using the XFS file system over a RAID array, optimize the file system settings for the RAID array you are using. In particular, set the values sunit, swidth, agcount, agsize according to the number of disks and stripe size of your array. More information can be found in XFS FAQ.

The XFS file system allows you to put the journal on a separate partition. This can lead to increased performance (if you are using a solid-state drive (SSD) for logging tasks), or reduce it (if one device serves the logs of several file systems, but it hasn’t enough performance).

File deletion from XFS filesystem is too slow. We don’t recommend to use it with “file-per-chunk” storage scheme. Use ext4 filesystem or large segments.

Note that most of the parameters that optimize XFS performance can only be set at the time the filesystem is created and can not be overridden at the time of mounting.

For more information about XFS, see man mkfs.xfs and man mount.

Use WriteBack for hardware RAID controllers with a backup battery (BBU)

WriteBack allows you to significantly increase recording performance by combining several small write operations into single one. The use of WriteBack is especially important for RAID5/RAID6 arrays, in which the checksum also must be recalculated and rewritten each time a single block is going to be written.

DO NOT USE WRITE-BACK MODE WITHOUT BACKUP BATTERY OR BATTERY FAILURE! THIS MAY LEAD TO DATA LOSS!

Align the HDD partitions

Sometimes, when using discs with Advanced Format, you need to align the partitions so that the partition beginning coincides with the physical sector beginning, otherwise performance degradation is possible.

To check whether partitions are created correctly, you can use the parted utility:

root@smartmedia:~# parted /dev/sdb “align-check”

alignment type(min/opt) [optimal]/minimal?

Partition number? 1

1 aligned

To automatically align the partition at the creation stage, pass the -a optimal option to the parted utility:

root@smartmedia:~# parted -a optimal /dev/sdb

GNU Parted 3.2

Using /dev/sdb

Welcome to GNU Parted! Type ‘help’ to view a list of commands.

(parted)

For more details about parted, see man parted or https://www.gnu.org/software/parted/manual/html_node/index.html.

Monitor the hardware

The smartctl system utility and the smartd service allow you to monitor the state of the block devices: the number of HDD corrupted blocks, the SSD health, etc.

smartctl can be used both for controlling single disks, and, in most cases, for controlling disks in a hardware RAID array.

Network Subsystem

Use the manufacturer’s recommended drivers and firmware for network adapters

Often, the low performance of the network subsystem is caused by drivers and low-level firmware of network adapters (NIC). Use recommended versions of drivers and firmware (as a rule, these are the latest publicly available versions).

Reduce the number of interrupts (IRQs) and distribute interrupts across multiple cores

Most of modern network adapters (NICs) allow you to configure the frequency of interrupts and the number of send (Tx) and receive (Rx) queues depending on the number of CPU cores. In some cases, the number of queues is set automatically, but for some adapters this value must be set manually when the NIC driver is loaded.

Use the irqbalance service to distribute interrupts across various OS kernels. Do not allow a large number of interrupts to be concentrated on one core.

In some cases, irqbalance redistributes interrupts too often, which can result in more than necessary NUMA memory accesses and performance degradation. In this case, it may make sense to stop the irqbalance service or manually redistribute the interrupts without it.

To find out the current interrupts distribution by CPU cores (the total number of interrupts processed by each kernel), you can run the following command:

cat /proc/interrupts

Unfortunately, the Linux OS does not allow to reset the interrupt counters after the kernel interrupts are redistributed.

Use the hardware acceleration capabilities of network adapters

In most cases, TCP Offloading and UDP Offloading allow to reduce CPU load while calculating TCP/UDP checksums and their defragmentation, as well as the number of interrupts, which also reduces the CPU load.

Optimize the OS network stack

Linux provides a wide range of network subsystem settings via procfs or using the sysctl utility.

For more information about OS network stack settings, see here.

Choose the best congestion control algorithm for TCP

The Linux network stack can use various congestion control algorithms for TCP. The list of algorithms loaded into the kernel and the currently used algorithm can be obtained using the sysctl utility:

# sysctl net.ipv4.tcp_available_congestion_control

net.ipv4.tcp_available_congestion_control = cubic reno

# sysctl net.ipv4.tcp_congestion_control

net.ipv4.tcp_congestion_control = cubic

The list of available modules in the system can be obtained as follows:

# find /lib/modules/`uname -r`/kernel/net/ipv4 -name tcp_*

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_diag.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_hybla.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_westwood.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_illinois.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_lp.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_veno.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_highspeed.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_dctcp.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_scalable.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_vegas.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_yeah.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_bic.ko

/lib/modules/3.10.0-514.el7.x86_64/kernel/net/ipv4/tcp_htcp.ko

You can load the module to the system using the modprobe command:

# modprobe tcp_yeah

You have mail in /var/spool/mail/root

# lsmod | grep yeah

tcp_yeah               12635  0

tcp_vegas              13839  1 tcp_yeah

Depending on the characteristics of the data transmission channel (RTT, packet loss, throughput), different algorithms will show different results. For example, mechanisms that work well at low losses (for example, cubic) can give unsatisfactory results at high losses (> 3–5%) and large RTT (> 100ms), and vice versa.

According to our observations, the yeah algorithm is suitable for most circumstances, but we recommend you to conduct your own testing and choose the best algorithm for your conditions.

HTTP Server (nginx)

Turn on the use of sendfile

Using the sendfile system call allows to significantly reduce the number of memory copying operations, reduce the CPU load and increase the overall system performance.

When using HTTPS or data compression (gzip), sendfile is not used, even if the appropriate setting is present (because the data must be encrypted and/or compressed before sending to the socket).

To enable sendfile globally, in the http section of the nginx configuration file, specify:

http {

   sendfile on;

}

The sendfile system call allows you to initiate the data transfer from a file descriptor to a TCP socket (or other non-blocking socket). The call will be terminated either after sending all the necessary data, or after the socket buffer is full and it becomes not ready for transmitting.

For very fast connections, this can mean that the blocking sendfile call will be completed only after the whole file is transferred. To avoid this, specify the maximum amount of data nginx will try to send per one sendfile call:

http {

   sendfile_max_chunk 131072;

}

Decreasing sendfile_max_chunk can lead to too many reads (can be partially compensated by read ahead, see above), increasing — to too long blocking of the primary execution thread or worker thread (see below) to send data.

Enable AIO threads on servers that perform blocking I/O

All read and write operations on block devices (disks, RAID arrays, SSDs, etc.) in Linux are blocking — the main execution thread goes into the D-state and waits for the operation to complete. In the case of nginx, when one thread serves a large number of client requests, this leads to the blocking of all requests by single I/O intensive query.

To prevent the thread from blocking by the I/O task, Linux supports asynchronous operations (AIO). But using AIO requires, at first, disabling the use of PageCache (operations must be performed with the O_DIRECT flag) and leads to a number of other shortcomings.

To work around the problem of using the system AIO, nginx (since version 1.7.11), supports the use of a thread pool to handle I/O operations. All blocking operations can be transferred to a separate thread pool, which eliminates the blocking of the main execution thread.

To enable the thread pool:

  1. Configure the thread pool in the main context of the configuration file, for example: thread_pool iopool threads=32 max_queue=65536; 
  2. Enable AIO for the desired location:

location /video/ {

   sendfile on;

   aio threads=iopool;

}

AIO threads do not make sense to include on locations that are reading from tmpfs (if swap is not used).

The expediency of using AIO threads on proxying data locations (without caching) depends on the probability of recording this data to a “slow” device (disk/SSD) — respectively, on the data type and the total system load.

Select a sufficient number of workers

Typically, the optimal number of workers is equal to the number of cores or number of threads (in case of HyperThreading) CPU.

Increasing the number of workers makes sense only if the workers go into the I/O wait mode (the so-called “D-state”), but in this case, it is usually better to use the thread pools.

Optimize the nginx service performance

Nginx has a large number of configuration options, allowing you to configure the interaction with a network stack, OS, memory for specific loads.

The basic settings can be:

timer_resolution 1ms;

events {

   worker_connections 10240;

   use epoll;

}

http {

   keepalive_timeout 60;

   tcp_nodelay on;

   tcp_nopush on;

}

CONTENTS
Sign-in
Sign-in with your SmartLabs Support Portal account credentials to see non-public articles.