-
Notifications
You must be signed in to change notification settings - Fork 40.6k
[FG:InPlacePodVerticalScaling] Performance degradation in latency-sensitive services due to CPU affinity loss upon guaranteed QoS Pod scaling down #131309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc |
/sig node |
@esotsal , @kad , @ffromani , @dchen1107 , @mrunalp , @swatisehgal . Hello everyone, I create an issue and two PRs for delay-sensitive service request. |
I think a bug is fine, which should be solved in the context of the current in-place-VPA KEP |
Thank you for your reply, is the in-place-VPA KEP you mentioned 1287-in-place-update-pod-resources? |
yes this is the KEP I meant. |
No, we should not. Pod should not do gamefication of kubelet's decisions. |
To add: I understand trying to motivate to implement LIFO type of allocation in case of scaling, but that's not the right solution. Scaled "worker" threads inside workloads are not necessarily will be in LIFO order. Some earlier added threads might be ended sooner than one added later. Considering what is "idle" can be done only by detailed usage stats on per-vCPU basis, but that is even bigger slippery slope for CPU allocation algorithms. |
@kad Thanks for the insights, much appreciated.
That unfortunately might not be the case in some applications. But do you think this can be a feature gate only enabled feature in case anyone needs this kind of behaviour? |
As I mentioned, LIFO is not a generic solution, we can't relay on that, and that would create one more implicit assumptions on "small font in documentation" behaviour, that would become new tribal knowledge that some app would be built on such assumptions. We can guarantee only allocated on start CPU cores that are considered "static" and not removable without container restart. That applies btw to scaling down as well: we can't scale below initially allocated for guaranteed QoS for CPU manager implementation without breaking previous functionality, again because of previous "tribal knowledge". :( For modern apps and running on modern kernels, situations with scaling up and then down should relay on fact that cpu cores are dynamic resources and might disappear. It means that instead of hard affinity to particular core, threads should affinitize to group of cores (if they want, in some smaller domains, e.g. within cache clusters). This would allow kernel to do migrations in case some of the cores become unavailable, or do runtime optimizations. In the past few years kernel folks due to increased amount of cores in modern processors, did a lot of optimisations for task migrations between cores in various cases: wake up from sleep, "overusage/quota", etc. We don't need interfere with those optimisations where it is not needed. |
@kad , Thank for your comments, but I an not clear about why it is a bigger slippery slope for per-vCPU usage stats, can I know more detail reason for it? |
Because it opens another can of worms with "for how long it was idle", "is less than 1% usage is considered idle?", "was it user or system cpu usage in last measurement period", and many other assumptions on future behaviour of application, which in most of the cases will not be really predictable based on current or past state observation. Workloads that are really HW performance sensitive (and those are becoming less and less restrictive due to increased performance of HW), they are usually dependant on specific inside particular hardware, e.g. in network cards it is amount of HW queues. Those are usually pre-allocated on the start and not really scaling up or down. Or, another example is polling external sensors for IoT/industrial, which is also quite static data. For all other cases, migration of the thread between vcpus on current generation of hardware is cheap, and it is possible to use less strict cpu affinity to more than one vcpu in the app, and allow kernel to do its job. |
@kad Perhaps it is not necessary to confirm that the CPU is really idle, as long as its utilization rate is the lowest. For our case, after stopping processes on a specific CPU, its utilization will become lower than others, then we can remove it. |
@kad |
As I mentioned above, trying to guess from orchestration layer internal behaviours of the apps is not really productive. Adding new assumptions (like LIFO) will lead to even more unpredictable behaviours in the apps that start to relay on such assumed behaviours. Kernel is more efficient to handle thread migrations based on allocated quotas allowed for process via cgroups. Thread migration within exclusive subset of cpu cores is cheap and efficient on modern HW/kernel. |
/triage accepted |
/cc @pravk03 |
Hi! Just to add my two cents to it: Recently I've been doing some tests with network packet manipulation and capture using different scaling scenarios. Even when using pretty mid-range desktop systems and general internet browsing packet capture dumps I was able to show packet loss when switching CPU cores. I wasn't looking deep into L1/L2 cache invalidation but rather overall performance dip when doing the switch. Taking microseconds to few milliseconds might have none or unnoticed impact on most applications, but will have linear impact on scenarios where data stream like packet capture if it gets faster the loss is bigger. Now considering carrier grade traffic rate when for example you want to log every packet with 'new' flag to have logs of users activity, when data rate is < 100 Mbps there's almost no loss, but I've repeatedly saw few hundred packets loss at rates > 1 Gbps (~80 - 100 packets). In scenarios like this it is considerable data loss. And that's using DPDK (I assume PF_RING would be the same) the loss was there, increasing caches of course only works up to certain point. So my point is there are scenarios where pinning to the core/setting affinity is only option to achieve high quality of service and there are for sure other applications/scenarios that need that featureset. I wholeheartedly agree with two other issues that were raised here: dropping a portion of highly specific code with risk being potential huge pain to maintain without domain specific knowledge and general architectural approach. |
Uh oh!
There was an error while loading. Please reload this page.
What happened?
For the latency-sensitive service in guaranteed QoS Pod, each worker has core affinity with a CPU.

When the workload decreases, some workers are removed, and the worker's affinity CPUs becoming idle.
When the Pod scaling down, these idle CPUs were excepted to be removed.
If the busy CPUs which run the busy worker are removed, the performance will be decreased for latency-sensitive services.
For example:
At the begin, there are 2 CPUs (CPU 1 anf CPU 11) are allocated to the Pod when Pod creation.
When the workload increases, Pod scale up, and 4 additional CPUs (CPU 2,12,3,13) are allocated to the Pod, add another 4 workers and set core affinity to each CPU.
When the workload decreases, remove worker 4 and worker 5, CPU 3 and CPU 13 are idle,

Ideal case:
When Pod scale down, idle CPU 3 and CPU 13 are excepted to be removed. the workers not be affected.
Non-ideal case:

When Pod scale down, busy CPU 2 and CPU 12 are removed. the worker2 and worker3 will be affected because of CPU migration due to CPU affinity loss.
What did you expect to happen?
When Pod scale down, idle CPU 3 and CPU 13 are removed, and busy CPUs are not removed.
So, we need some method or option to let container tell kubelet which CPU should be keep.
How can we reproduce it (as minimally and precisely as possible)?
Test based on #129719
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: