-
Notifications
You must be signed in to change notification settings - Fork 40.6k
HPA stuck at maxReplicas even though metric under target #120875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig autoscaling |
So is this the old classic HPA issue of the deployment selector labels matching pods outside of that deployment? e.g. |
If it is caused by
As I said here (3 years ago!):
I think this comment is still valid because from a user perspective a deployment is specified by name but underneath the pods are selected by something (potentially) totally different. This is kind of an obfuscation. At a minimum, do you think a feature request to show this situation loud and clear in the status of the |
@max-rocket-internet I've been following #78761 since I experienced the same problem before and I agree that this should be treated as a bug. While the selector conflict may be seen as a configuration bug on the part of the operator, and easily worked around by ensuring a minimal set of unique selectors between all I did a quick look at the code and found out that the autoscaler controller uses The side-effect I can think off with the addition of |
What worked for me to resolve the issue with HPA not scaling down despite the fact the CPU/Memory utilization was below target was to remove |
Commenting on my previous statement - the moment the deployment scales up, the hpa does not scale it down until the |
I solved my problem like this
|
So it's not resolved for you then. |
I get the same problem with a HPA scaling based on memory
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE All the running pods are with the memory under the target and still doesn't scale down. I also tried wpferreira01 workaround even though i have only one policy per scale type but it didn't helped. |
i got the same issue |
To maybe help the conversation, I was investigating on that matter as well for the last 4 hours. I'm using minikube and I'm fairly new to k8s. For my case I have an app that scaled on CPU utilization only. It needed 5-10 minutes to scale back to minimum replicas but eventually it did. |
I am also having similar problem, but i am using custom metrics from prometheus adapter. After scaling to the max pods, it is not scaling down. Even though metric is current zero below the threshold defined in hpa. And I also have checked the label issue @max-rocket-internet was talking about. I don't have any other pods with the same labels. Here is my HPA spec,
Deployment
The below curl request indicates, the custom metric value is zero
|
We have the same problem on K8s 1.25 and using HPA autoscaling/v2. The OP is on 1.25 also. Is anyone on 1.26+ having the issue as well? The only solution we could find was to disambiguate our The saddest part about this workaround is that |
I moved from k8s to EKS on AWS, with 3 clusters, one of them 1.28 and two 1.29 |
Please reach out to AWS/EKS support @IgalSc |
@dims |
FWIW, our AWS Support rep pointed us to this thread. LOL |
But, to be fair, AWS contributes a lot of code to the K8s upstream https://chat.openai.com/share/52255931-cf9a-4a60-a450-730b2bb10220. We will escalate this within AWS and report back. |
I don't know if this is the case with AWS support... I'm using Azure and the same thing happens. Throughout the thread we have replication in minikube, I imagine it is something with kubernetes itself |
Roger that. Maybe we can "prod" AWS to submit a fix. Meanwhile, I guess we have to perform a workaround similar to what I am proposing in #120875 (comment) |
For ZERO downtime on Production I guess you can try
Imagine doing that for 50 microservices on Production on both your primary and failover clusters? Just typing this out makes me wanna cry. |
## Description of change [crm457-1508](https://dsdmoj.atlassian.net/browse/CRM457-1508) ## Notes for reviewer [Trying to differentiate between deployments for hpa to work properly](kubernetes/kubernetes#120875 (comment)) ## Screenshots of changes (if applicable) ### Before changes: ### After changes: ## How to manually test the feature
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale This is still observed in K8s v1.29 and 1.30. There are no ambiguous label selectors in deployments affected by this, so some of the earlier comments in this thread about label matching do not apply. |
leave a trace here... got same issue with this, in GKE 1.28 the fix is to update the deployments
hopefully help and clarify others that jumping here recently.... |
Anyone found a workaround or fix on this, I upgraded my kubernetes to v1.30 and deduct load testing on the environment then the autoscaler start to acting weird and keep scaling down and up on schedule even the metrics don't seem to be over the threshold. first of all I though it is a metrics server issue but I did upgrade it and tried to do scaling policies but without luck to fix the issue. |
I did it using Keda autoscalling instead of default HPA |
Wow. Keda. Gracias, @gsGabriel Here is a YouTube video from the DevOps Toolkit about Keda https://youtu.be/3lcaawKAv6s?si=qWZ2as6AixzH_6EN |
Thank you, @gsGabriel and @ccmcbeck, for the quick reply. I want to hint at the root cause of my issue here for anyone who may be stuck like me. Thank you again, and I hope this will be helpful for others. |
Hello! I am having this issue with resource type memory. With resource type cpu, it is scaling up an down normally. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Spoke with @adrianmoisey (who also raised this in #124307) and will open a KEP to explore how to solve this issue. |
/remove-lifecycle stale |
Uh oh!
There was an error while loading. Please reload this page.
What happened?
HPA does not reduce
Deployment
replica count even though resource metric is below target. It is stuck atmaxReplicas
.What did you expect to happen?
Deployment replica count should be reduced.
How can we reproduce it (as minimally and precisely as possible)?
We can see multiple examples in our clusters but not sure how to reproduce it exactly.
Here's some relevant
kubectl
output:The
Deployment
:The
HorizontalPodAutoscaler
:Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
No response
Install tools
No response
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: