HPA wrongly assumes that terminated pods have an utilization of 100% #129866

jm-franc · 2025-01-28T19:40:37Z

What happened?

A pod that terminated was considered by the HPA controller to be at its target utilization.

The controller logic (1, 2) is such that, while scaling up, it conservatively considers that a pod for which we couldn't get the utilization metric from a metrics API are at their target utilization. (On scale down, the assumption is conservatively that the utilization is 0.)

What did you expect to happen?

I expected the controller to assume that a terminated pod has an utilization of 0.

This is already correctly handled for pods that terminated with a failure, but the case where a pod terminated successfully isn't handled.

How can we reproduce it (as minimally and precisely as possible)?

Create a Deployment with pods that terminate (without a failure) and observe that an HPA targetting this Deployment will assume that terminated pods are at target utilization.

Anything else we need to know?

Handling the case where the pod is terminated normally here will fix this.

Kubernetes version

$ kubectl version
v1.29.10-gke.1280000

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux rodete"
NAME="Debian GNU/Linux rodete"
VERSION_CODENAME=rodete

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

None

k8s-ci-robot · 2025-01-28T19:40:46Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jm-franc · 2025-01-28T19:40:55Z

/sig autoscaling

Aaina26 · 2025-01-29T05:02:35Z

Hi...Can you try to reproduce this in latest version? Also maybe someone from sig cloud-provider can help
/sig cloud-provider

Aaina26 · 2025-01-29T05:08:22Z

I feel this issue is similar to: #129228 and #120875

Seems like this issue is yet to be resolved. Still it would be better to reproduce this in latest versions to be sure.

elmiko · 2025-01-29T17:13:56Z

we are reviewing this issue in the sig cloud provider office hours this week, we have a couple questions:

is there an indication that this problem is related to a specific cloud provider or infrastructure?
- we see that GKE is the listed cloud, is this occuring on other clouds?
in reviewing the issue it is not clear how sig cloud provider might be able to help, @Aaina26 could you add a little more context about the cloud provider reference?
is there any indication that this issue is happening in other kubernetes versions? (e.g. can it be reproduced in a more recent version)
do we know if the proposed fix PR solves the issue? (see Fix HPA controller assuming terminated pods have a 100% utilization #129868)

Aaina26 · 2025-01-30T08:56:06Z

Hi..

According to HPA stuck at maxReplicas even though metric under target #120875..this issue is being faced in AWS EKS as well
Since, i don't have access to any cloud provider..I mentioned cloud-provider so that someone with access can verify this.
Regarding the PR linked to this..maybe @jm-franc can explain.
Also, i am relatively a new contributor. So, i can try to verify this on a local k8s(latest version) cluster. that might help clear things

adrianmoisey · 2025-01-30T09:05:34Z

This doesn't seem like a cloud-provider issue, as it seems limited to HPA only

adrianmoisey · 2025-01-30T19:14:09Z

Going to remove sig cloud-provider, as I don't believe this is related
We can add it back if needed

/remove-sig cloud-provider

cmotta2016 · 2025-02-07T19:48:10Z

We are observing similar behavior in the deployment of the nginx-controller. For some reason, after the update, some pods ended up with a Completed status, and the HPA 'counted' these pods, preventing further scaledown.

k8s-triage-robot · 2025-05-08T20:42:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jm-franc · 2025-05-08T20:51:31Z

/remove-lifecycle stale

omerap12 · 2025-05-09T20:14:48Z

I can take a look.
/assign

omerap12 · 2025-05-09T20:22:34Z

Oh sorry, I see you already started working on that, @jm-franc. Are you planning to keep going with it? If not, I'm happy to take it over.

jm-franc · 2025-05-09T20:45:08Z

Oh thanks Omer! I've been busy with other things but I think I'll have time to finish this soonish.

omerap12 · 2025-05-09T20:55:25Z

Oh thanks Omer! I've been busy with other things but I think I'll have time to finish this soonish.

Great, thanks!
/unassign
/assign @jm-franc

jm-franc added the kind/bug Categorizes issue or PR as related to a bug. label Jan 28, 2025

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 28, 2025

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 28, 2025

jm-franc linked a pull request Jan 28, 2025 that will close this issue

Fix HPA controller assuming terminated pods have a 100% utilization #129868

Draft

k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Jan 29, 2025

k8s-ci-robot removed the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Jan 30, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2025

k8s-ci-robot assigned omerap12 May 9, 2025

k8s-ci-robot assigned jm-franc and unassigned omerap12 May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

jm-franc commented Jan 28, 2025

k8s-ci-robot commented Jan 28, 2025

Uh oh!

jm-franc commented Jan 28, 2025

Uh oh!

Aaina26 commented Jan 29, 2025

Uh oh!

Aaina26 commented Jan 29, 2025

Uh oh!

elmiko commented Jan 29, 2025

Uh oh!

Aaina26 commented Jan 30, 2025

Uh oh!

adrianmoisey commented Jan 30, 2025

Uh oh!

adrianmoisey commented Jan 30, 2025

Uh oh!

cmotta2016 commented Feb 7, 2025

Uh oh!

k8s-triage-robot commented May 8, 2025

Uh oh!

jm-franc commented May 8, 2025

Uh oh!

omerap12 commented May 9, 2025

Uh oh!

omerap12 commented May 9, 2025

Uh oh!

jm-franc commented May 9, 2025

Uh oh!

omerap12 commented May 9, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

Comments

jm-franc commented Jan 28, 2025

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Jan 28, 2025

Uh oh!

jm-franc commented Jan 28, 2025

Uh oh!

Aaina26 commented Jan 29, 2025

Uh oh!

Aaina26 commented Jan 29, 2025

Uh oh!

elmiko commented Jan 29, 2025

Uh oh!

Aaina26 commented Jan 30, 2025

Uh oh!

adrianmoisey commented Jan 30, 2025

Uh oh!

adrianmoisey commented Jan 30, 2025

Uh oh!

cmotta2016 commented Feb 7, 2025

Uh oh!

k8s-triage-robot commented May 8, 2025

Uh oh!

jm-franc commented May 8, 2025

Uh oh!

omerap12 commented May 9, 2025

Uh oh!

omerap12 commented May 9, 2025

Uh oh!

jm-franc commented May 9, 2025

Uh oh!

omerap12 commented May 9, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.