Best-effort topology mgr policy doesn't give best-effort CPU NUMA alignment #106270

JanScheurich · 2021-11-09T16:12:22Z

What happened?

Use case:

A DPDK application uses VLAN trunking on SR-IOV NICs and requires dedicated SR-IOV NICs. For cost-reasons there is only one SRIOV-NIC per server but, to exploit the CPU resources optimally, the applications need to run one single-NUMA pod per CPU socket. For these pods, CPUs and huge-pages must be allocated from the same NUMA node, while the SR-IOV device may be allocated from the NIC on the remote NUMA node, if necessary.

Problem description:

K8s bare metal node with CPU topology
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
The single SR-IOV NIC is on NUMA 0.

Kubelet is configured with
• CPU manager policy "static"
• Topology manager policy "best-effort"
• reserved_cpus: 0,1,40,41

The application creates two Guaranteed QoS DPDK pods requesting 32 CPUs each. The Remaining 6 CPUs per NUMA node are meant to be used by best-effort and burstable QoS pods.

The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.

Unfortunately this is not what happens: The CPU manager assigns CPUs 2-32,42-72 on NUMA 0 to the first pod and remaining CPUs 34-38,74-78 on NUMA 0 plus CPUs 3-25,43-65 on NUMA 1 to the second pod, thus breaking the DPDK application, which requires single NUMA CPUs.

What did you expect to happen?

The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.

How can we reproduce it (as minimally and precisely as possible)?

See above. Create two guaranteed QoS pods with integer CPU requests and an SR-IOV device request from an SR-IOV network device pool on one NUMA node only such that the pods won't fit on the same NUMA node but one pod doesn't fully occupy the SR-IOV NUMA node either.

Anything else we need to know?

Analysis:

The problem is that for the second pod (to be landed on NUMA 1) the CPU manager offers the topology hints [10 (preferred), 11 (not preferred)]. The affinity bit strings enumerate the NUMA node right to left. The device manager's hint is [01 (preferred)]. The topology manager unconditionally merges these into a best hint 01 (not preferred). It does so by iterating over the cross-product of all provider hints, doing a bitwise AND of the affinity masks. For non-zero results the preferred status is set to true if and only if all combined provider hints were preferred. In our case the only non-zero affinity mask is 11 & 01 = 01, and it is not preferred.

With topology manager policies "single-numa-node" or "restricted" the topology manager would immediately reject pod admission. With "best effort" policy it admits the pod and returns the computed "best hint" 01 (not preferred) to the CPU and device manager for their resource allocations. Hence the CPU manager starts allocating CPUs from NUMA 0 and (since there are not enough) fills up the rest from NUMA 1. Note that the "best hint" 01 is not even among the options supplied by the CPU manager in the first place.

Proposal:

If there is no preferred best hint, the topology manager with "best-effort" policy should return one preferred hint to each provider from the original list that it had received. For the device manager that would be 01, for the CPU manager it would be 10. That way, each resource owner could do its best to guarantee NUMA locality among its resources.
We will provide a corresponding PR to open the discussion on how to improve the best-effort behavior of the topology manager.

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"841a4f4f3d4528aa284171074e00503faea18496", GitTreeState:"clean", BuildDate:"2021-08-31T06:52:44Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

none

OS version

# On Linux:
$ cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

$ uname -a
Linux control-plane-n108-mast-n027 5.3.18-24.75.3.22886.0.PTF.1187468-default #1 SMP Thu Sep 9 23:24:48 UTC 2021 (37ce29d) x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

JanScheurich · 2021-11-09T16:13:03Z

/sig node

ffromani · 2021-11-11T08:22:52Z

/cc @klueska

SergeyKanzhelev · 2021-11-17T18:50:29Z

/triage accepted

as per @fromanirh this is legitimate request, but more like a feature

/remove-kind bug
/kind feature

ffromani · 2021-11-17T18:53:33Z

/triage accepted

as per @fromanirh this is legitimate request, but more like a feature

/remove-kind bug /kind feature

yup. As mentioned in the PR linked to this issue, I'll have a deep look ASAP.

k8s-triage-robot · 2022-02-15T18:55:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

zouyee · 2022-02-17T03:11:52Z

/remove-lifecycle stale

k8s-triage-robot · 2022-05-18T03:47:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

JanScheurich · 2022-05-18T07:05:00Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-16T07:54:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-08-16T22:55:31Z

/remove-lifecycle stale

k8s-triage-robot · 2022-11-14T23:26:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-12-14T23:54:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

JanScheurich · 2022-12-15T08:29:09Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-01-19T21:04:50Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

ffromani · 2024-01-22T07:24:02Z

/triage accepted

k8s-triage-robot · 2025-01-21T08:23:11Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2025-04-21T09:12:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-05-21T09:19:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

JanScheurich added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2021

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 9, 2021

pperiyasamy mentioned this issue Nov 10, 2021

provide preferred cpu affinity for best effort topology policy #106308

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 14, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 15, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 21, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2025

Best-effort topology mgr policy doesn't give best-effort CPU NUMA alignment #106270

Best-effort topology mgr policy doesn't give best-effort CPU NUMA alignment #106270

Comments

JanScheurich commented Nov 9, 2021

What happened?

Use case:

Problem description:

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Analysis:

Proposal:

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

JanScheurich commented Nov 9, 2021

Uh oh!

ffromani commented Nov 11, 2021

Uh oh!

SergeyKanzhelev commented Nov 17, 2021

Uh oh!

ffromani commented Nov 17, 2021

Uh oh!

k8s-triage-robot commented Feb 15, 2022

Uh oh!

zouyee commented Feb 17, 2022

Uh oh!

k8s-triage-robot commented May 18, 2022

Uh oh!

JanScheurich commented May 18, 2022

Uh oh!

k8s-triage-robot commented Aug 16, 2022

Uh oh!

vaibhav2107 commented Aug 16, 2022

Uh oh!

k8s-triage-robot commented Nov 14, 2022

Uh oh!

k8s-triage-robot commented Dec 14, 2022

Uh oh!

JanScheurich commented Dec 15, 2022

Uh oh!

k8s-triage-robot commented Jan 19, 2024

Uh oh!

ffromani commented Jan 22, 2024

Uh oh!

k8s-triage-robot commented Jan 21, 2025

Uh oh!

k8s-triage-robot commented Apr 21, 2025

Uh oh!

k8s-triage-robot commented May 21, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.