Skip to content

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jm-franc opened this issue Jan 28, 2025 · 15 comments · May be fixed by #129868
Open

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

jm-franc opened this issue Jan 28, 2025 · 15 comments · May be fixed by #129868
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@jm-franc
Copy link
Contributor

What happened?

A pod that terminated was considered by the HPA controller to be at its target utilization.

The controller logic (1, 2) is such that, while scaling up, it conservatively considers that a pod for which we couldn't get the utilization metric from a metrics API are at their target utilization. (On scale down, the assumption is conservatively that the utilization is 0.)

What did you expect to happen?

I expected the controller to assume that a terminated pod has an utilization of 0.

This is already correctly handled for pods that terminated with a failure, but the case where a pod terminated successfully isn't handled.

How can we reproduce it (as minimally and precisely as possible)?

Create a Deployment with pods that terminate (without a failure) and observe that an HPA targetting this Deployment will assume that terminated pods are at target utilization.

Anything else we need to know?

Handling the case where the pod is terminated normally here will fix this.

Kubernetes version

$ kubectl version
v1.29.10-gke.1280000

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux rodete"
NAME="Debian GNU/Linux rodete"
VERSION_CODENAME=rodete

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

None
@jm-franc jm-franc added the kind/bug Categorizes issue or PR as related to a bug. label Jan 28, 2025
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 28, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jm-franc
Copy link
Contributor Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 28, 2025
@Aaina26
Copy link
Contributor

Aaina26 commented Jan 29, 2025

Hi...Can you try to reproduce this in latest version? Also maybe someone from sig cloud-provider can help
/sig cloud-provider

@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Jan 29, 2025
@Aaina26
Copy link
Contributor

Aaina26 commented Jan 29, 2025

I feel this issue is similar to: #129228 and #120875

Seems like this issue is yet to be resolved. Still it would be better to reproduce this in latest versions to be sure.

@elmiko
Copy link
Contributor

elmiko commented Jan 29, 2025

we are reviewing this issue in the sig cloud provider office hours this week, we have a couple questions:

  • is there an indication that this problem is related to a specific cloud provider or infrastructure?
    • we see that GKE is the listed cloud, is this occuring on other clouds?
  • in reviewing the issue it is not clear how sig cloud provider might be able to help, @Aaina26 could you add a little more context about the cloud provider reference?
  • is there any indication that this issue is happening in other kubernetes versions? (e.g. can it be reproduced in a more recent version)
  • do we know if the proposed fix PR solves the issue? (see Fix HPA controller assuming terminated pods have a 100% utilization #129868)

@Aaina26
Copy link
Contributor

Aaina26 commented Jan 30, 2025

Hi..

  • According to HPA stuck at maxReplicas even though metric under target #120875..this issue is being faced in AWS EKS as well
  • Since, i don't have access to any cloud provider..I mentioned cloud-provider so that someone with access can verify this.
  • Regarding the PR linked to this..maybe @jm-franc can explain.
  • Also, i am relatively a new contributor. So, i can try to verify this on a local k8s(latest version) cluster. that might help clear things

@adrianmoisey
Copy link
Member

This doesn't seem like a cloud-provider issue, as it seems limited to HPA only

@adrianmoisey
Copy link
Member

Going to remove sig cloud-provider, as I don't believe this is related
We can add it back if needed

/remove-sig cloud-provider

@k8s-ci-robot k8s-ci-robot removed the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Jan 30, 2025
@cmotta2016
Copy link

We are observing similar behavior in the deployment of the nginx-controller. For some reason, after the update, some pods ended up with a Completed status, and the HPA 'counted' these pods, preventing further scaledown.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2025
@jm-franc
Copy link
Contributor Author

jm-franc commented May 8, 2025

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2025
@omerap12
Copy link
Member

omerap12 commented May 9, 2025

I can take a look.
/assign

@omerap12
Copy link
Member

omerap12 commented May 9, 2025

Oh sorry, I see you already started working on that, @jm-franc. Are you planning to keep going with it? If not, I'm happy to take it over.

@jm-franc
Copy link
Contributor Author

jm-franc commented May 9, 2025

Oh thanks Omer! I've been busy with other things but I think I'll have time to finish this soonish.

@omerap12
Copy link
Member

omerap12 commented May 9, 2025

Oh thanks Omer! I've been busy with other things but I think I'll have time to finish this soonish.

Great, thanks!
/unassign
/assign @jm-franc

@k8s-ci-robot k8s-ci-robot assigned jm-franc and unassigned omerap12 May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy