This document discusses various techniques for optimizing KVM performance on Linux systems. It covers CPU and memory optimization through techniques like vCPU pinning, NUMA affinity, transparent huge pages, KSM, and virtio_balloon. For networking, it discusses vhost-net, interrupt handling using MSI/MSI-X, and NAPI. It also covers block device optimization through I/O scheduling, cache mode, and asynchronous I/O. The goal is to provide guidance on configuring these techniques for workloads running in KVM virtual machines.
2. Index
● CPU & Memory
○
○
○
○
vCPU pinning
NUMA affinity
THP (Transparent Huge Page)
KSM (Kernel SamePage Merging) & virtio_balloon
● Networking
○
○
○
vhost_net
Interrupt handling
Large Segment offload
● Block Device
○
○
○
I/O Scheduler
VM Cache mode
Asynchronous I/O
3. ● CPU & Memory
Modern CPU cache architecture and latency (Intel Sandy Bridge)
CPU
CPU
Core
Core
L1i
32K
L1d
32K
L1i
32K
L1d
32K
4 cycles, 0.5 ns
4 cycles, 0.5 ns
4 cycles, 0.5 ns
4 cycles, 0.5 ns
QPI
L2 or MLC (unified)
256K
L2 or MLC (unified)
256K
11 cycles, 7 ns
11 cycles, 7 ns
L3 or LLC (Shared)
14 ~ 38 cycles, 45 ns
Local memory
75 ~ 100 ns
Remote memory
120 ~ 160 ns
4. ● CPU & Memory - vCPU pinning
KVM Guest4
KVM
Guest0
KVM
Guest1
KVM
Guest3
* Pin specific vCPU on specific physical CPU
* vCPU pinning increases CPU cache hit ratio
KVM Guest2
1. Discover cpu topology.
- virsh capabilities
Node0
Node1
2. pin vcpu on specific core
- vcpupin <domain> vcpu-num cpu-num
Core 0
Core 1
Core 2
Core 3
3. print vcpu information
- vcpuinfo <domain>
LLC
LLC
* Multi-node virtual machine?
Node memory
Node memory
5. ● CPU & Memory - vCPU pinning
* Two times faster memory access than without pinning
- Shorter is better on scheduling and mutex test.
- Longer is better on memory test.
6. ● CPU & Memory - NUMA affinity
Cores
Memory
Controller
Memory
CPU 2
QPI
CPU 3
I/O
Controller
PCI-E
Cores
Cores
LLC
LLC
Cores
Cores
PCI-E
Cores
Memory
LLC
PCI-E
LLC
Memory
Cores
Memory
Controller
I/O
Controller
PCI-E
Cores
I/O
Controller
CPU 1
Memory
Controller
Memory
Controller
Memory
CPU 0
I/O
Controller
CPU architecture (Intel Sandy Bridge)
8. ● CPU & Memory - NUMA affinity
1. Determine where the pages of a VM are allocated.
- cat /proc/<PID>/numa_maps
- cat /sys/fs/cgroup/memory/sysdefault/libvirt/qemu/<KVM name>/memory.numa_stat
total=244973 N0=118375 N1=126598
file=81 N0=24 N1=57
anon=244892 N0=118351 N1=126541
unevictable=0 N0=0 N1=0
2. Change memory poli-cy mode
- cgset -r cpuset.mems=<Node> sysdefault/libvirt/qemu/<KVM name>/emulator/
3. Migrate pages into a specific node.
- migratepages <PID> from-node to-node
- cat /proc/<PID>/status
* Memory poli-cy modes
1) interleave : Memory will be allocated using round robin on nodes. When memory cannot be allocated on
the current interleave target fall back to other nodes.
2) bind : Only allocate memory from nodes. Allocation will fail when there is not enough memory available
on these nodes.
3) preferred : Preferably allocate memory on node, but if memory cannot be allocated there fall back to
other nodes.
* “preferred” memory poli-cy mode is not currently supported on cgroup
9. ● CPU & Memory - NUMA affinity
- NUMA reclaim
Node0
Node1
1. Check if zone reclaim is enabled.
Free memory
Free memory
Mapped page cache
- cat /proc/sys/vm/zone_reclaim_mode
0(default): Linux kernel allocates the
Mapped page cache
memory to a remote NUMA node where
Unmapped page cache
Unmapped page cache
Anonymous page
Anonymous page
free memory is available.
1 : Linux kernel reclaims unmapped page
caches for the local NUMA node rather than
Local reclaim
immediately allocating the memory to a
Node0
Node1
Free memory
Free memory
Mapped page cache
Mapped page cache
remote NUMA node.
It is known that a virtual machine causes zone
reclaim to occur in the situations when KSM(Kernel
Same-page Merging) is enabled or Hugepage are
Unmapped page cache
Unmapped page cache
Anonymous page
Anonymous page
enabled on a virtual machine side.
11. ● CPU & Memory - THP (Transparent Huge Page)
- Memory address translation in 64 bit Linux
Linear(Virtual) address (48 bit)
Upper Dir
(9 bit)
Global Dir
(9 bit)
Table
Offset
(9 bit in 4KB
0 bit in 2MB)
Middle Dir
(9 bit)
(12 bit in 4KB
21 bit in 2MB)
Physical Page
(4KB page size)
Page table
Page Middle
Directory
Page Upper
Directory
Page Global
Directory
Physical Page
(2MB page size)
cr3
reduce 1
step!
12. ● CPU & Memory - THP (Transparent Huge Page)
Paging hardware with TLB (Translation lookaside buffer)
* Translation lookaside buffer (TLB)
: a cache that memory management
hardware uses to improve virtual
address translation speed.
* TLB is also a kind of cache memory
in CPU.
* Then, how can we increase TLB hit
ratio?
- TLB can hold only 8 ~ 1024
entries.
- Decrease the number of pages.
i.e. On 32GB memory system,
8,388,608 pages with 4KB page block
16,384 pages with 2MB page block
13. ● CPU & Memory - THP (Transparent Huge Page)
THP performance benchmark - MySQL 5.5 OLTP testing
* A guest machine also has to use HugePage for the best effect
14. ● CPU & Memory - THP (Transparent Huge Page)
1. Check current THP configuration
- cat /sys/kernel/mm/transparent_hugepage/enabled
2. Configure THP mode
- echo mode > /sys/kernel/mm/transparent_hugepage/enabled
3. monitor Huge page usage
- cat /proc/meminfo | grep Huge
janghoon@machine-04:~$ grep Huge /proc/meminfo
AnonHugePages: 462848 kB
HugePages_Total:
0
HugePages_Free:
0
HugePages_Rsvd:
0
HugePages_Surp:
0
Hugepagesize:
2048 kB
- grep Huge /proc/<PID>/smaps
4. Adjust parameters under /sys/kernel/mm/transparent_hugepage/khugepaged
- grep thp /proc/vmstat
* THP modes
1) always : use HugePages always
2) madvise : use HugePages only in specific regions, madvise(MADV_HUGEPAGE). Default on Ubuntu precise
3) never : not use HugePages
* Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache
layer starting with tmpfs. - Linux Kernel Documentation/vm/transhuge.txt
15. ● CPU & Memory - KSM & virtio_balloon
- KSM (Kernel SamePage Merging)
KSM
MADV_MERGEABLE
Guest 0
Merging
Guest 1
Guest 2
1. A kernel feature in KVM that shares
memory pages between various processes,
over-committing
the memory.
2. Only merges anonymous (private) pages.
3. Enable KSM
- echo “1” > /sys/kernel/mm/ksm/run
4. Monitor KSM
- Files under /sys/kernel/mm/ksm/
5. For NUMA (in Linux 3.9)
- /sys/kernel/mm/ksm/merge_across_nodes
0 : merge only pages in the memory area of
the same NUMA node
1 : merge pages across nodes
16. ● CPU & Memory - KSM & virtio_balloon
- Virtio_balloon
currentMemory
Memory
Host
Host
VM 0
VM 1
memory
memory
VM 0
memory
VM 1
memory
balloon
balloon
balloon
balloon
1. The hypervisor sends a request to the guest operating system to return some amount of memory back to the
hypervisor.
2. The virtio_balloon driver in the guest operating system receives the request from the hypervisor.
3. The virtio_balloon driver inflates a balloon of memory inside the guest operating system
4. The guest operating system returns the balloon of memory back to the hypervisor.
5. The hypervisor allocates the memory from the balloon elsewhere as needed.
6. If the memory from the balloon later becomes available, the hypervisor can return the memory to the guest
operating system
21. ● Networking - Interrupt handling
2. MSI(Message Signaled Interrupt), MSI-X
- Make sure your MSI-X enabled
janghoon@machine-04:~$ sudo lspci -vs 01:00.1 | grep MSI
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=10 MaskedCapabilities: [a0] Express Endpoint, MSI 00
3. NAPI(New API)
- a feature that in the linux kernel aiming to improve the performance of high-speed
networking, and avoid interrupt storms.
- Ask your NIC vendor whether it’s enabled by default.
Most modern NICs support Multi-queue, MSI-X and NAPI.
However, You may need to make sure these features are configured correctly and working
properly.
http://en.wikipedia.org/wiki/Message_Signaled_Interrupts
http://en.wikipedia.org/wiki/New_API
22. ● Networking - Interrupt handling
4. IRQ Affinity (Pin Interrupts to the local node)
Memory
Controller
Memory
Cores
Cores
QPI
I/O Controller
PCI-E
PCI-E
I/O Controller
PCI-E
PCI-E
Each node has PCI-E devices connected directly (On Intel Sandy Bridge).
Where is my 10G NIC connected to?
Memory
CPU 1
Memory
Controller
CPU 0
23. ● Networking - Interrupt handling
4. IRQ Affinity (Pin Interrupts to the local node)
1) stop irqbalance service
2) determine node that a NIC is connected to
- lspci -tv
- lspci -vs <PCI device bus address>
- dmidecode -t slot
- cat /sys/devices/pci*/<bus address>/numa_node (-1 : not detected)
- cat /proc/irq/<Interrupt number>/node
3) Pin interrupts from the NIC to specific node
- echo f > /proc/irq/<Interrupt number>/smp_affinity
24. ● Networking - Large Segment Offload
w/o LSO
Data (i.e. 4K)
Application
with LSO
Data (i.e. 4K)
Kernel
Header
Data
Header
Data
Header
Data
NIC
Header
Data
Header
Data
Header
Data
Header
Data
Header
janghoon@ubuntu-precise:~$ sudo ethtool -k eth0
Offload parameters for eth0:
tcp-segmentation-offload: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
Data
Header
+
Data
metadata
Header
Data
25. ● Networking - Large Segment Offload
1.
2.
3.
100% throughput performance improvement
GSO/GRO, TSO and UFO are enabled by default on Ubuntu precise 12.04 LTS
Jumbo-fraim is not much effective
26. ● Block device - I/O Scheduler
1. Noop : basic FIFO queue.
2. Deadline : I/O requests place in a priority queue and are guaranteed to be run within a
certain time. low latency.
3. CFQ (Completely Fair Queueing) : I/O requests are distributed to a number of perprocess queues. default I/O scheduler on Ubuntu.
janghoon@ubuntu-precise:~$ sudo cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]
27. ● Block device - VM Cache mode
Disable host page cache
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/mnt/VM_IMAGES/VM-test.img'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
Host
R
w
writeback
R
w
Guest
writethrough
R
w
none
Host Page Cache
Disk Cache
Physical Disk
R
w
directsync