-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Best-effort topology mgr policy doesn't give best-effort CPU NUMA alignment #106270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/sig node |
/cc @klueska |
/triage accepted as per @fromanirh this is legitimate request, but more like a feature /remove-kind bug |
yup. As mentioned in the PR linked to this issue, I'll have a deep look ASAP. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
What happened?
Use case:
A DPDK application uses VLAN trunking on SR-IOV NICs and requires dedicated SR-IOV NICs. For cost-reasons there is only one SRIOV-NIC per server but, to exploit the CPU resources optimally, the applications need to run one single-NUMA pod per CPU socket. For these pods, CPUs and huge-pages must be allocated from the same NUMA node, while the SR-IOV device may be allocated from the NIC on the remote NUMA node, if necessary.
Problem description:
K8s bare metal node with CPU topology
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
The single SR-IOV NIC is on NUMA 0.
Kubelet is configured with
• CPU manager policy "static"
• Topology manager policy "best-effort"
• reserved_cpus: 0,1,40,41
The application creates two Guaranteed QoS DPDK pods requesting 32 CPUs each. The Remaining 6 CPUs per NUMA node are meant to be used by best-effort and burstable QoS pods.
The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.
Unfortunately this is not what happens: The CPU manager assigns CPUs 2-32,42-72 on NUMA 0 to the first pod and remaining CPUs 34-38,74-78 on NUMA 0 plus CPUs 3-25,43-65 on NUMA 1 to the second pod, thus breaking the DPDK application, which requires single NUMA CPUs.
What did you expect to happen?
The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.
How can we reproduce it (as minimally and precisely as possible)?
See above. Create two guaranteed QoS pods with integer CPU requests and an SR-IOV device request from an SR-IOV network device pool on one NUMA node only such that the pods won't fit on the same NUMA node but one pod doesn't fully occupy the SR-IOV NUMA node either.
Anything else we need to know?
Analysis:
The problem is that for the second pod (to be landed on NUMA 1) the CPU manager offers the topology hints [10 (preferred), 11 (not preferred)]. The affinity bit strings enumerate the NUMA node right to left. The device manager's hint is [01 (preferred)]. The topology manager unconditionally merges these into a best hint 01 (not preferred). It does so by iterating over the cross-product of all provider hints, doing a bitwise AND of the affinity masks. For non-zero results the preferred status is set to true if and only if all combined provider hints were preferred. In our case the only non-zero affinity mask is 11 & 01 = 01, and it is not preferred.
With topology manager policies "single-numa-node" or "restricted" the topology manager would immediately reject pod admission. With "best effort" policy it admits the pod and returns the computed "best hint" 01 (not preferred) to the CPU and device manager for their resource allocations. Hence the CPU manager starts allocating CPUs from NUMA 0 and (since there are not enough) fills up the rest from NUMA 1. Note that the "best hint" 01 is not even among the options supplied by the CPU manager in the first place.
Proposal:
If there is no preferred best hint, the topology manager with "best-effort" policy should return one preferred hint to each provider from the original list that it had received. For the device manager that would be 01, for the CPU manager it would be 10. That way, each resource owner could do its best to guarantee NUMA locality among its resources.
We will provide a corresponding PR to open the discussion on how to improve the best-effort behavior of the topology manager.
Kubernetes version
Cloud provider
none
OS version
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: