Skip to content

Best-effort topology mgr policy doesn't give best-effort CPU NUMA alignment #106270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JanScheurich opened this issue Nov 9, 2021 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@JanScheurich
Copy link

What happened?

Use case:

A DPDK application uses VLAN trunking on SR-IOV NICs and requires dedicated SR-IOV NICs. For cost-reasons there is only one SRIOV-NIC per server but, to exploit the CPU resources optimally, the applications need to run one single-NUMA pod per CPU socket. For these pods, CPUs and huge-pages must be allocated from the same NUMA node, while the SR-IOV device may be allocated from the NIC on the remote NUMA node, if necessary.

Problem description:

K8s bare metal node with CPU topology
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
The single SR-IOV NIC is on NUMA 0.

Kubelet is configured with
• CPU manager policy "static"
• Topology manager policy "best-effort"
• reserved_cpus: 0,1,40,41

The application creates two Guaranteed QoS DPDK pods requesting 32 CPUs each. The Remaining 6 CPUs per NUMA node are meant to be used by best-effort and burstable QoS pods.

The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.

Unfortunately this is not what happens: The CPU manager assigns CPUs 2-32,42-72 on NUMA 0 to the first pod and remaining CPUs 34-38,74-78 on NUMA 0 plus CPUs 3-25,43-65 on NUMA 1 to the second pod, thus breaking the DPDK application, which requires single NUMA CPUs.

What did you expect to happen?

The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.

How can we reproduce it (as minimally and precisely as possible)?

See above. Create two guaranteed QoS pods with integer CPU requests and an SR-IOV device request from an SR-IOV network device pool on one NUMA node only such that the pods won't fit on the same NUMA node but one pod doesn't fully occupy the SR-IOV NUMA node either.

Anything else we need to know?

Analysis:

The problem is that for the second pod (to be landed on NUMA 1) the CPU manager offers the topology hints [10 (preferred), 11 (not preferred)]. The affinity bit strings enumerate the NUMA node right to left. The device manager's hint is [01 (preferred)]. The topology manager unconditionally merges these into a best hint 01 (not preferred). It does so by iterating over the cross-product of all provider hints, doing a bitwise AND of the affinity masks. For non-zero results the preferred status is set to true if and only if all combined provider hints were preferred. In our case the only non-zero affinity mask is 11 & 01 = 01, and it is not preferred.

With topology manager policies "single-numa-node" or "restricted" the topology manager would immediately reject pod admission. With "best effort" policy it admits the pod and returns the computed "best hint" 01 (not preferred) to the CPU and device manager for their resource allocations. Hence the CPU manager starts allocating CPUs from NUMA 0 and (since there are not enough) fills up the rest from NUMA 1. Note that the "best hint" 01 is not even among the options supplied by the CPU manager in the first place.

Proposal:

If there is no preferred best hint, the topology manager with "best-effort" policy should return one preferred hint to each provider from the original list that it had received. For the device manager that would be 01, for the CPU manager it would be 10. That way, each resource owner could do its best to guarantee NUMA locality among its resources.
We will provide a corresponding PR to open the discussion on how to improve the best-effort behavior of the topology manager.

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"841a4f4f3d4528aa284171074e00503faea18496", GitTreeState:"clean", BuildDate:"2021-08-31T06:52:44Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

none

OS version

# On Linux:
$ cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

$ uname -a
Linux control-plane-n108-mast-n027 5.3.18-24.75.3.22886.0.PTF.1187468-default #1 SMP Thu Sep 9 23:24:48 UTC 2021 (37ce29d) x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@JanScheurich JanScheurich added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2021
@JanScheurich
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 9, 2021
@ffromani
Copy link
Contributor

/cc @klueska

@SergeyKanzhelev
Copy link
Member

/triage accepted

as per @fromanirh this is legitimate request, but more like a feature

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Nov 17, 2021
@ffromani
Copy link
Contributor

/triage accepted

as per @fromanirh this is legitimate request, but more like a feature

/remove-kind bug /kind feature

yup. As mentioned in the PR linked to this issue, I'll have a deep look ASAP.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2022
@zouyee
Copy link
Member

zouyee commented Feb 17, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@JanScheurich
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 14, 2022
@JanScheurich
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 15, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@ffromani
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2024
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 21, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy