[FG:InPlacePodVerticalScaling] Performance degradation in latency-sensitive services due to CPU affinity loss upon guaranteed QoS Pod scaling down #131309

Chunxia202410 · 2025-04-15T13:23:34Z

What happened?

For the latency-sensitive service in guaranteed QoS Pod, each worker has core affinity with a CPU.
When the workload decreases, some workers are removed, and the worker's affinity CPUs becoming idle.
When the Pod scaling down, these idle CPUs were excepted to be removed.
If the busy CPUs which run the busy worker are removed, the performance will be decreased for latency-sensitive services.
For example:
At the begin, there are 2 CPUs (CPU 1 anf CPU 11) are allocated to the Pod when Pod creation.
When the workload increases, Pod scale up, and 4 additional CPUs (CPU 2,12,3,13) are allocated to the Pod, add another 4 workers and set core affinity to each CPU.

When the workload decreases, remove worker 4 and worker 5, CPU 3 and CPU 13 are idle,
Ideal case:
When Pod scale down, idle CPU 3 and CPU 13 are excepted to be removed. the workers not be affected.

Non-ideal case:
When Pod scale down, busy CPU 2 and CPU 12 are removed. the worker2 and worker3 will be affected because of CPU migration due to CPU affinity loss.

What did you expect to happen?

When Pod scale down, idle CPU 3 and CPU 13 are removed, and busy CPUs are not removed.

So, we need some method or option to let container tell kubelet which CPU should be keep.

How can we reproduce it (as minimally and precisely as possible)?

Test based on #129719

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v0.0.0-master+da732990954ed92779d8fe396ab9e217f038d115

Cloud provider

Self-host

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
6.5.0-41-generic #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun  3 11:32:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

esotsal · 2025-04-16T13:54:52Z

/cc

BenTheElder · 2025-04-16T17:31:06Z

/sig node

Chunxia202410 · 2025-04-17T02:59:09Z

@esotsal , @kad , @ffromani , @dchen1107 , @mrunalp , @swatisehgal . Hello everyone, I create an issue and two PRs for delay-sensitive service request.
This request is described as an issue (kind-bug), I am not sure is it better to create an issue(KEP)?
I have given two feasible solutions, but I am not sure which one is better or acceptable, please help to review the issue and PRs, and feel free to comment if there are any suggestions.
Thank you very much~
PR1: #131328
PR2: #131331

ffromani · 2025-04-17T08:26:40Z

@esotsal , @kad , @ffromani , @dchen1107 , @mrunalp , @swatisehgal . Hello everyone, I create an issue and two PRs for delay-sensitive service request. This request is described as an issue (kind-bug), I am not sure is it better to create an issue(KEP)? I have given two feasible solutions, but I am not sure which one is better or acceptable, please help to review the issue and PRs, and feel free to comment if there are any suggestions. Thank you very much~ PR1: #131328 PR2: #131331

I think a bug is fine, which should be solved in the context of the current in-place-VPA KEP

Chunxia202410 · 2025-04-17T11:30:18Z

I think a bug is fine, which should be solved in the context of the current in-place-VPA KEP

Thank you for your reply, is the in-place-VPA KEP you mentioned 1287-in-place-update-pod-resources?

ffromani · 2025-04-17T11:31:21Z

I think a bug is fine, which should be solved in the context of the current in-place-VPA KEP

Thank you for your reply, is the in-place-VPA KEP you mentioned 1287-in-place-update-pod-resources?

yes this is the KEP I meant.

kad · 2025-04-18T06:58:50Z

So, we need some method or option to let container tell kubelet which CPU should be keep.

No, we should not. Pod should not do gamefication of kubelet's decisions.
Anything that is not allocated initially is subject to change and can be taken away at any moment.
Pod should react on resizes correctly. Migration between CPU cores is very cheap and fast operation on modern hardware.
If application trying to do hard affinity of threads on dynamic resources, it should be able to handle situations when this resource disappears.

kad · 2025-04-18T07:02:27Z

To add: I understand trying to motivate to implement LIFO type of allocation in case of scaling, but that's not the right solution. Scaled "worker" threads inside workloads are not necessarily will be in LIFO order. Some earlier added threads might be ended sooner than one added later. Considering what is "idle" can be done only by detailed usage stats on per-vCPU basis, but that is even bigger slippery slope for CPU allocation algorithms.

radoslawc · 2025-04-24T09:20:30Z

@kad Thanks for the insights, much appreciated.

If application trying to do hard affinity of threads on dynamic resources, it should be able to handle situations when this resource disappears.

That unfortunately might not be the case in some applications. But do you think this can be a feature gate only enabled feature in case anyone needs this kind of behaviour?

kad · 2025-04-24T09:54:28Z

That unfortunately might not be the case in some applications. But do you think this can be a feature gate only enabled feature in case anyone needs this kind of behaviour?

As I mentioned, LIFO is not a generic solution, we can't relay on that, and that would create one more implicit assumptions on "small font in documentation" behaviour, that would become new tribal knowledge that some app would be built on such assumptions.

We can guarantee only allocated on start CPU cores that are considered "static" and not removable without container restart. That applies btw to scaling down as well: we can't scale below initially allocated for guaranteed QoS for CPU manager implementation without breaking previous functionality, again because of previous "tribal knowledge". :(

For modern apps and running on modern kernels, situations with scaling up and then down should relay on fact that cpu cores are dynamic resources and might disappear. It means that instead of hard affinity to particular core, threads should affinitize to group of cores (if they want, in some smaller domains, e.g. within cache clusters). This would allow kernel to do migrations in case some of the cores become unavailable, or do runtime optimizations. In the past few years kernel folks due to increased amount of cores in modern processors, did a lot of optimisations for task migrations between cores in various cases: wake up from sleep, "overusage/quota", etc. We don't need interfere with those optimisations where it is not needed.

Chunxia202410 · 2025-04-25T11:56:40Z

Considering what is "idle" can be done only by detailed usage stats on per-vCPU basis, but that is even bigger slippery slope for CPU allocation algorithms.

@kad , Thank for your comments, but I an not clear about why it is a bigger slippery slope for per-vCPU usage stats, can I know more detail reason for it?

kad · 2025-04-25T20:16:04Z

Considering what is "idle" can be done only by detailed usage stats on per-vCPU basis, but that is even bigger slippery slope for CPU allocation algorithms.

@kad , Thank for your comments, but I an not clear about why it is a bigger slippery slope for per-vCPU usage stats, can I know more detail reason for it?

Because it opens another can of worms with "for how long it was idle", "is less than 1% usage is considered idle?", "was it user or system cpu usage in last measurement period", and many other assumptions on future behaviour of application, which in most of the cases will not be really predictable based on current or past state observation.

Workloads that are really HW performance sensitive (and those are becoming less and less restrictive due to increased performance of HW), they are usually dependant on specific inside particular hardware, e.g. in network cards it is amount of HW queues. Those are usually pre-allocated on the start and not really scaling up or down. Or, another example is polling external sensors for IoT/industrial, which is also quite static data. For all other cases, migration of the thread between vcpus on current generation of hardware is cheap, and it is possible to use less strict cpu affinity to more than one vcpu in the app, and allow kernel to do its job.

Chunxia202410 · 2025-04-29T09:26:24Z

Because it opens another can of worms with "for how long it was idle", "is less than 1% usage is considered idle?", "was it user or system cpu usage in last measurement period", and many other assumptions on future behaviour of application, which in most of the cases will not be really predictable based on current or past state observation.

@kad Perhaps it is not necessary to confirm that the CPU is really idle, as long as its utilization rate is the lowest. For our case, after stopping processes on a specific CPU, its utilization will become lower than others, then we can remove it.
Regarding "how long", Kubernetes' cAdvisor monitoring interval is controlled by defaultHousekeepingInterval, currently set to 10 seconds. For the pods which scaled need to based on CPU usage, after terminating processes, it should wait at least 10 seconds before trigger the pod scale-down command to ensure metric updates.

Chunxia202410 · 2025-05-07T12:49:35Z

Migration between CPU cores is very cheap and fast operation on modern hardware.
If application trying to do hard affinity of threads on dynamic resources, it should be able to handle situations when this resource disappears.

@kad
For more complex situations, such as when there are multiple threads in a Pod. The app needs to regularly check the changes in cpuset and perform some operations on threads that need to change CPU affinity, which may take some time. In this case, for delay-sensitive services (such as carrier-grade services), in addition to the impact brought by CPU migration, during the recovery period, it is also need to share the CPU with other threads, and this period of time will have an impact on delay-sensitive services.
To achieve seamless pod scaling for delay-sensitive services, it need to add an options to allow Pods to decide on CPU removal.

kad · 2025-05-13T12:24:07Z

As I mentioned above, trying to guess from orchestration layer internal behaviours of the apps is not really productive. Adding new assumptions (like LIFO) will lead to even more unpredictable behaviours in the apps that start to relay on such assumed behaviours. Kernel is more efficient to handle thread migrations based on allocated quotas allowed for process via cgroups. Thread migration within exclusive subset of cpu cores is cheap and efficient on modern HW/kernel.

sreeram-venkitesh · 2025-05-14T17:47:54Z

/triage accepted

pravk03 · 2025-05-14T17:51:17Z

/cc @pravk03

radoslawc · 2025-05-16T11:49:02Z

Migration between CPU cores is very cheap and fast operation on modern hardware.
If application trying to do hard affinity of threads on dynamic resources, it should be able to handle situations when this resource disappears.

@kad For more complex situations, such as when there are multiple threads in a Pod. The app needs to regularly check the changes in cpuset and perform some operations on threads that need to change CPU affinity, which may take some time. In this case, for delay-sensitive services (such as carrier-grade services), in addition to the impact brought by CPU migration, during the recovery period, it is also need to share the CPU with other threads, and this period of time will have an impact on delay-sensitive services. To achieve seamless pod scaling for delay-sensitive services, it need to add an options to allow Pods to decide on CPU removal.

Hi! Just to add my two cents to it: Recently I've been doing some tests with network packet manipulation and capture using different scaling scenarios. Even when using pretty mid-range desktop systems and general internet browsing packet capture dumps I was able to show packet loss when switching CPU cores. I wasn't looking deep into L1/L2 cache invalidation but rather overall performance dip when doing the switch. Taking microseconds to few milliseconds might have none or unnoticed impact on most applications, but will have linear impact on scenarios where data stream like packet capture if it gets faster the loss is bigger. Now considering carrier grade traffic rate when for example you want to log every packet with 'new' flag to have logs of users activity, when data rate is < 100 Mbps there's almost no loss, but I've repeatedly saw few hundred packets loss at rates > 1 Gbps (~80 - 100 packets). In scenarios like this it is considerable data loss. And that's using DPDK (I assume PF_RING would be the same) the loss was there, increasing caches of course only works up to certain point. So my point is there are scenarios where pinning to the core/setting affinity is only option to achieve high quality of service and there are for sure other applications/scenarios that need that featureset.

I wholeheartedly agree with two other issues that were raised here: dropping a portion of highly specific code with risk being potential huge pain to maintain without domain specific knowledge and general architectural approach.
So for the first thing what we can do is to make sure there's enough documentation to not leave any nooks and crannies undocumented and have it test covered. For the actual design as this is pretty complicated issue we're open to any suggestions how in general would be a good, acceptable approach to implement this featureset.

Chunxia202410 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2025

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2025

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2025

github-project-automation bot added this to SIG Node Bugs Apr 16, 2025

github-project-automation bot moved this to Triage in SIG Node Bugs Apr 16, 2025

github-project-automation bot added this to SIG Node: In Place Pod Vertical Scaling Apr 17, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 14, 2025

sreeram-venkitesh moved this from Triage to Triaged in SIG Node Bugs May 14, 2025

[FG:InPlacePodVerticalScaling] Performance degradation in latency-sensitive services due to CPU affinity loss upon guaranteed QoS Pod scaling down #131309

[FG:InPlacePodVerticalScaling] Performance degradation in latency-sensitive services due to CPU affinity loss upon guaranteed QoS Pod scaling down #131309

Comments

Chunxia202410 commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

esotsal commented Apr 16, 2025

Uh oh!

BenTheElder commented Apr 16, 2025

Uh oh!

Chunxia202410 commented Apr 17, 2025

Uh oh!

ffromani commented Apr 17, 2025

Uh oh!

Chunxia202410 commented Apr 17, 2025

Uh oh!

ffromani commented Apr 17, 2025

Uh oh!

kad commented Apr 18, 2025

Uh oh!

kad commented Apr 18, 2025

Uh oh!

radoslawc commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kad commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chunxia202410 commented Apr 25, 2025

Uh oh!

kad commented Apr 25, 2025

Uh oh!

Chunxia202410 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chunxia202410 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kad commented May 13, 2025

Uh oh!

sreeram-venkitesh commented May 14, 2025

Uh oh!

pravk03 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

radoslawc commented May 16, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Chunxia202410 commented Apr 15, 2025 •

edited

Loading

radoslawc commented Apr 24, 2025 •

edited

Loading

kad commented Apr 24, 2025 •

edited

Loading

Chunxia202410 commented Apr 29, 2025 •

edited

Loading

Chunxia202410 commented May 7, 2025 •

edited

Loading

pravk03 commented May 14, 2025 •

edited

Loading