Swap-over-NBD without deadlocking
From: | Mel Gorman <mgorman@suse.de> | |
To: | Linux-MM <linux-mm@kvack.org>, Linux-Netdev <netdev@vger.kernel.org> | |
Subject: | [PATCH 00/13] Swap-over-NBD without deadlocking | |
Date: | Tue, 26 Apr 2011 08:36:41 +0100 | |
Message-ID: | <1303803414-5937-1-git-send-email-mgorman@suse.de> | |
Cc: | LKML <linux-kernel@vger.kernel.org>, David Miller <davem@davemloft.net>, Neil Brown <neilb@suse.de>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Mel Gorman <mgorman@suse.de> | |
Archive‑link: | Article |
Changelog since V1 o Rebase on top of mmotm o Use atomic_t for memalloc_socks (David Miller) o Remove use of sk_memalloc_socks in vmscan (Neil Brown) o Check throttle within prepare_to_wait (Neil Brown) o Add statistics on throttling instead of printk Swapping over NBD is something that is technically possible but not often advised. While there are number of guides on the internet on how to configure it and nbd-client supports a -swap switch to "prevent deadlocks", the fact of the matter is a machine using NBD for swap can be locked up within minutes if swap is used intensively. The problem is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Ziljstra developed a series of patches that supported swap over an NFS that some distributions are carrying in their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD (the relatively straight-forward case) and uses throttling instead of dynamically resized memory reserves so the series is not too unwieldy for review. Patch 1 serialises access to min_free_kbytes. It's not strictly needed by this series but as the series cares about watermarks in general, it's a harmless fix. It could be merged independently. Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeying memory. Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 6-9 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 10 is a micro-optimisation to avoid a function call in the common case. Patch 11 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 12 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 13 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. Here is the results from netperf using slab as an example NETPERF UDP netperf-udp udp-swapnbd vanilla-slab v1r17-slab 64 178.06 ( 0.00%)* 189.46 ( 6.02%) 1.02% 1.00% 128 355.06 ( 0.00%) 370.75 ( 4.23%) 256 662.47 ( 0.00%) 721.62 ( 8.20%) 1024 2229.39 ( 0.00%) 2567.04 (13.15%) 2048 3974.20 ( 0.00%) 4114.70 ( 3.41%) 3312 5619.89 ( 0.00%) 5800.09 ( 3.11%) 4096 6460.45 ( 0.00%) 6702.45 ( 3.61%) 8192 9580.24 ( 0.00%) 9927.97 ( 3.50%) 16384 13259.14 ( 0.00%) 13493.88 ( 1.74%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 2960.17 2540.14 Total Elapsed Time (seconds) 3554.10 3050.10 NETPERF TCP netperf-tcp tcp-swapnbd vanilla-slab v1r17-slab 64 1230.29 ( 0.00%) 1273.17 ( 3.37%) 128 2309.97 ( 0.00%) 2375.22 ( 2.75%) 256 3659.32 ( 0.00%) 3704.87 ( 1.23%) 1024 7267.80 ( 0.00%) 7251.02 (-0.23%) 2048 8358.26 ( 0.00%) 8204.74 (-1.87%) 3312 8631.07 ( 0.00%) 8637.62 ( 0.08%) 4096 8770.95 ( 0.00%) 8704.08 (-0.77%) 8192 9749.33 ( 0.00%) 9769.06 ( 0.20%) 16384 11151.71 ( 0.00%) 11135.32 (-0.15%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 1245.04 1619.89 Total Elapsed Time (seconds) 1250.66 1622.18 Here is the equivalent test for SLUB NETPERF UDP netperf-udp udp-swapnbd vanilla-slub v1r17-slub 64 180.83 ( 0.00%) 183.68 ( 1.55%) 128 357.29 ( 0.00%) 367.11 ( 2.67%) 256 679.64 ( 0.00%)* 724.03 ( 6.13%) 1.15% 1.00% 1024 2343.40 ( 0.00%)* 2610.63 (10.24%) 1.68% 1.00% 2048 3971.53 ( 0.00%) 4102.21 ( 3.19%)* 1.00% 1.40% 3312 5677.04 ( 0.00%) 5748.69 ( 1.25%) 4096 6436.75 ( 0.00%) 6549.41 ( 1.72%) 8192 9698.56 ( 0.00%) 9808.84 ( 1.12%) 16384 13337.06 ( 0.00%) 13404.38 ( 0.50%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 2880.15 2180.13 Total Elapsed Time (seconds) 3458.10 2618.09 NETPERF TCP netperf-tcp tcp-swapnbd vanilla-slub v1r17-slub 64 1256.79 ( 0.00%) 1287.32 ( 2.37%) 128 2308.71 ( 0.00%) 2371.09 ( 2.63%) 256 3672.03 ( 0.00%) 3771.05 ( 2.63%) 1024 7245.08 ( 0.00%) 7261.60 ( 0.23%) 2048 8315.17 ( 0.00%) 8244.14 (-0.86%) 3312 8611.43 ( 0.00%) 8616.90 ( 0.06%) 4096 8711.64 ( 0.00%) 8695.97 (-0.18%) 8192 9795.71 ( 0.00%) 9774.11 (-0.22%) 16384 11145.48 ( 0.00%) 11225.70 ( 0.71%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 1345.05 1425.06 Total Elapsed Time (seconds) 1350.61 1430.66 Time to completion varied a lot but this can happen with netperf as it tries to find results within a sufficiently high confidence. I wouldn't read too much into the performance gains of netperf-udp as it can sometimes be affected by code just shuffling around for whatever reason. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 16*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches, the machine locks up within minutes and runs to completion with them applied. Comments? drivers/block/nbd.c | 7 +- include/linux/gfp.h | 7 +- include/linux/mm_types.h | 8 ++ include/linux/mmzone.h | 1 + include/linux/sched.h | 7 ++ include/linux/skbuff.h | 19 +++- include/linux/slub_def.h | 1 + include/linux/vm_event_item.h | 1 + include/net/sock.h | 19 ++++ kernel/softirq.c | 3 + mm/page_alloc.c | 57 ++++++++-- mm/slab.c | 240 +++++++++++++++++++++++++++++++++++------ mm/slub.c | 35 +++++- mm/vmscan.c | 58 ++++++++++ mm/vmstat.c | 1 + net/core/dev.c | 52 ++++++++- net/core/filter.c | 8 ++ net/core/skbuff.c | 95 ++++++++++++++--- net/core/sock.c | 42 +++++++ net/ipv4/tcp.c | 3 +- net/ipv4/tcp_output.c | 13 ++- net/ipv6/tcp_ipv6.c | 12 ++- 22 files changed, 603 insertions(+), 86 deletions(-) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/