HPA doesn't scale down to minReplicas even though metric is under target #78761

max-rocket-internet · 2019-06-06T11:59:43Z

What happened:

HPA scales to Spec.MaxReplicas even though metric is always under target.

Here's the HPA in YAML:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2019-06-06T10:46:13Z","reason":"ReadyForNewScale","message":"recommended
      size matches current size"},{"type":"ScalingActive","status":"True","lastTransitionTime":"2019-06-06T10:46:13Z","reason":"ValidMetricFound","message":"the
      HPA was able to successfully calculate a replica count from cpu resource utilization
      (percentage of request)"},{"type":"ScalingLimited","status":"True","lastTransitionTime":"2019-06-06T10:46:13Z","reason":"TooManyReplicas","message":"the
      desired replica count is more than the maximum replica count"}]'
    autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"Resource","resource":{"name":"cpu","currentAverageUtilization":0,"currentAverageValue":"9m"}}]'
  creationTimestamp: "2019-06-06T10:45:58Z"
  name: my-app-1
  namespace: default
  resourceVersion: "55041251"
  selfLink: /apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers/my-app-1
  uid: 44fedc1a-8848-11e9-8465-025acf90d81e
spec:
  maxReplicas: 4
  minReplicas: 2
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: my-app-1
  targetCPUUtilizationPercentage: 40
status:
  currentCPUUtilizationPercentage: 0
  currentReplicas: 4
  desiredReplicas: 4

And here's a description output:

$ kubectl describe hpa my-app-1
  Name:                                                  my-app-1
  Namespace:                                             default
  Labels:                                                <none>
  Annotations:                                           <none>
  CreationTimestamp:                                     Thu, 06 Jun 2019 12:45:58 +0200
  Reference:                                             Deployment/my-app-1
  Metrics:                                               ( current / target )
    resource cpu on pods  (as a percentage of request):  0% (9m) / 40%
  Min replicas:                                          2
  Max replicas:                                          4
  Deployment pods:                                       4 current / 4 desired
  Conditions:
    Type            Status  Reason            Message
    ----            ------  ------            -------
    AbleToScale     True    ReadyForNewScale  recommended size matches current size
    ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
    ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
  Events:           <none>

What you expected to happen:

HPA only scales up when metric is above target and scales down when under until Spec.MinReplicas is reached.

How to reproduce it (as minimally and precisely as possible):

I'm not sure. We have 9 HPAs and only one has this problem. I can't see anything unique about this HPA when comparing to the others. If I delete and recreate the HPA using Helm, same problem. Also if I recreate the HPA using kubectl autoscale Deployment/my-app-1 --min=2 --max=4 --cpu-percent=40 same problem.

Environment:

Kubernetes version (use kubectl version): v1.12.6-eks-d69f1b
Cloud provider or hardware configuration: AWS EKS
OS (e.g: cat /etc/os-release): EKS AMI release v20190327
Kernel (e.g. uname -a): 4.14.104-95.84.amzn2.x86_64
Network plugin and version (if this is a network-related bug): AWS CNI
Metrics-server version: 0.3.2

The text was updated successfully, but these errors were encountered:

max-rocket-internet · 2019-06-06T12:00:56Z

@kubernetes/sig-autoscaling-bugs

k8s-ci-robot · 2019-06-06T12:01:05Z

@max-rocket-internet: Reiterating the mentions to trigger a notification:
@kubernetes/sig-autoscaling-bugs

In response to this:

@kubernetes/sig-autoscaling-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

max-rocket-internet · 2019-06-06T12:08:58Z

If I scale down manually:

I0606 12:06:57.269751 1 horizontal.go:592] Successful rescale of my-app-1, old size: 2, new size: 4, reason: cpu resource utilization (percentage of request) above target

And these are the resources specified in the deployment:

        resources:
          limits:
            cpu: 2048m
            memory: 4Gi
          requests:
            cpu: 2048m
            memory: 4Gi

tedyu · 2019-06-07T16:22:37Z

Can you attach log to this issue ?

thanks

max-rocket-internet · 2019-06-12T09:57:43Z

Hi @tedyu

I am using AWS EKS so the only HPA related log entries I can see are like this and nothing more:

I0612 09:51:36.511060 1 horizontal.go:777] Successfully updated status for xxxx
I0612 09:52:04.080816 1 horizontal.go:777] Successfully updated status for yyyy
I0612 09:52:34.415303 1 horizontal.go:777] Successfully updated status for zzzz

max-rocket-internet · 2019-06-12T11:04:38Z

@tedyu Is there some other way I can get more debug information?

tedyu · 2019-06-12T11:39:40Z

There are logs at higher verbosity. e.g. (not that this would be logged in your cluster)

		klog.V(4).Infof("proposing %v desired replicas (based on %s from %s) for %s", metricDesiredReplicas, metricName, metricTimestamp, reference)

See if you can tune up verbosity

max-rocket-internet · 2019-06-12T12:08:38Z

@tedyu thanks for the suggestion but I don't think we have that option on EKS as it's K8S service: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html

I guess I have to chase it up with AWS support?

max-rocket-internet · 2019-06-18T16:50:47Z

Our deployment strategy could be also relevant:

spec:
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 0
    type: RollingUpdate

max-rocket-internet · 2019-08-16T08:46:38Z

Just look at these events and missing reason:

Events:
  Type     Reason                        Age                    From                       Message
  ----     ------                        ----                   ----                       -------
  Normal   SuccessfulRescale             4m59s                  horizontal-pod-autoscaler  New size: 8; reason: All metrics below target
  Normal   SuccessfulRescale             4m44s                  horizontal-pod-autoscaler  New size: 16; reason:
  Normal   SuccessfulRescale             4m29s                  horizontal-pod-autoscaler  New size: 32; reason:
  Normal   SuccessfulRescale             4m14s                  horizontal-pod-autoscaler  New size: 64; reason:
  Normal   SuccessfulRescale             3m59s                  horizontal-pod-autoscaler  New size: 128; reason:
  Normal   SuccessfulRescale             3m44s                  horizontal-pod-autoscaler  New size: 200; reason:
  Normal   SuccessfulRescale             0s (x2 over 5m14s)     horizontal-pod-autoscaler  New size: 4; reason: All metrics below target

vdemonchy · 2019-08-21T10:01:23Z

I'm having the exact same issue as you @max-rocket-internet, also running on EKS with their latest version available to date. This is frustrating :(

SocietyCao · 2019-08-23T03:43:58Z

@vdemonchy There may be sudden bursts of traffic，some times cause CPUUtilization to 100% , in this case,it won't scale down

max-rocket-internet · 2019-08-23T08:14:09Z

some times cause CPUUtilization to 100% , in this case,it won't scale down

This is not the case.

SocietyCao · 2019-08-23T14:42:53Z

The pod is all ready?
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details

If there were any missing metrics, we recompute the average more conservatively, assuming those pods were consuming 100% of the desired value in case of a scale down, and 0% in case of a scale up.
I can only speculate this is the case

cmanzi · 2019-08-31T01:15:05Z

@max-rocket-internet Try increasing your metrics resolution from the default. I was experienceing similar behavior, I added the flag --metric-resolution=5s (the default is 60s), and it seems to be behaving in a much more expected manner now.

As @SocietyCao said, in my case it appears that the HPA was rapidly scaling up my service, creating a bunch of pods that didn't have any metrics yet, which in turn caused the HPA to assume the pods were under load. Seems like it can create a feedback loop of sorts.

wxwang33 · 2019-09-27T19:51:38Z

We are seeing the same behavior. Has this issue been resolved?

cmanzi · 2019-09-27T20:17:46Z

@wxwang33 What is your metrics-server resolution set to? That fixed it for me (on 1.14.6).

wxwang33 · 2019-09-27T23:38:58Z

I will check later as I don't have direct access to it. Will update and thanks for the quick response!

itninja-hue · 2019-10-18T15:45:06Z

I am having the same, issue , at first i thought its not scaling down, but as time went by , exactly 6 minutes , hpa scaled down the pods.
After looking at the events log , it showed up that an event was triggered to scale down the pods the moment the cpi load went off , but got executes 6 minutes later. I guess this is a feature or how it is supposed to work , it would be nice if we can get a gracePeriod option to define if we want to narrow down or reduce the time.
I am running k8s 1.16 on vagrant (test cluster) 1 master 2 workers

rosoft2001 · 2022-06-21T02:46:22Z

still an issue

coding-red-panda · 2022-06-22T15:31:30Z

This is also an issue for us with the HPA when using 2 metrics.
With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.

We haven't tried setting the threshold to double the expected size to see if this makes a difference.
The HPA does downscale, but only up to a certain point.

vitobotta · 2022-06-22T17:29:16Z

This is also an issue for us with the HPA when using 2 metrics. With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.

We haven't tried setting the threshold to double the expected size to see if this makes a difference. The HPA does downscale, but only up to a certain point.

OT - which dashboard is that?

coding-red-panda · 2022-06-22T17:42:44Z

This is also an issue for us with the HPA when using 2 metrics. With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.

We haven't tried setting the threshold to double the expected size to see if this makes a difference. The HPA does downscale, but only up to a certain point.

OT - which dashboard is that?

ArgoCD from our private clusters

vitobotta · 2022-06-22T18:08:15Z

This is also an issue for us with the HPA when using 2 metrics. With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.

We haven't tried setting the threshold to double the expected size to see if this makes a difference. The HPA does downscale, but only up to a certain point.

OT - which dashboard is that?

ArgoCD from our private clusters

Thanks!

otakuinside · 2022-07-07T10:06:46Z

Based on my experience, the key factor is that the metric is based on the Requested, not the Limit. Then, the condition to match is the Usage vs Request and the scale criteria.
On the other side, if you have 2 replicas, the required condition to scaleDown to 1 is that both replicas are 49%, so if it scales, the single one replica remaining would allocate 98% (49% on its side + 49% on the Terminated scaledDown pod). If Usage is not under 50% then an scaleDown trigger would cause the only single remaining pod to handle more than 100% of usage, which would immediately cause and scaleUp condition (which would be a riduculous loop :p).
If you have 3 pods instead of 2, the % change from 50% to 66% but the underlying analysis remains.

k8s-triage-robot · 2022-08-06T10:22:06Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jimethn · 2022-09-01T13:39:29Z

Seeing this in kubernetes 1.21 when using custom metrics. The metric drops below target and the HPA responds by scaling up.

  Type    Reason             Age                   From                       Message
  ----    ------             ----                  ----                       -------
  Normal  SuccessfulRescale  5m52s (x49 over 15h)  horizontal-pod-autoscaler  New size: 4; reason: Service metric cortex_query_scheduler_queue_length above target
  Normal  SuccessfulRescale  2m20s (x34 over 9h)   horizontal-pod-autoscaler  New size: 8; reason: All metrics below target

ismashal · 2022-09-07T11:44:48Z

I facing the same issue Scale down not working for me

NAME READY STATUS RESTARTS AGE
pod/device-service-78668474b5-fwd75 1/1 Running 0 156m
pod/device-service-78668474b5-g9cvg 1/1 Running 0 89m
pod/device-service-78668474b5-hz79v 1/1 Running 0 37m

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
horizontalpodautoscaler.autoscaling/device-service Deployment/device-service 33%/40%, 0%/70% 1 3 3 3h46m

markandersontrocme · 2022-09-16T17:57:51Z

Having the same issues with some HPA in 1.21 using API autoscaling/v1. Can anyone confirm if this is working better with autoscaling/v2?

k8s-triage-robot · 2022-10-16T18:32:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2022-10-16T18:32:40Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

h0jeZvgoxFepBQ2C · 2022-10-16T19:06:24Z

/remove-lifecycle rotten

this is still a valid issue

h0jeZvgoxFepBQ2C · 2022-10-16T19:06:38Z

/reopen

k8s-ci-robot · 2022-10-16T19:06:42Z

@h0jeZvgoxFepBQ2C: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

h0jeZvgoxFepBQ2C · 2022-10-16T19:07:13Z

Could someone reopen this issue? I'm not allowed to do it

coding-red-panda · 2022-10-17T07:06:51Z

Yeah, this issue needs to be reopened as this is a blocker for using memory HPA in Kubernetes.

dani-newman · 2022-11-03T10:24:48Z

Any news?

h0jeZvgoxFepBQ2C · 2022-11-03T10:32:04Z

@liggitt @wojtek-t @pohly @smarterclayton Could anyone of you reopen this issue maybe?

nicon89 · 2023-01-16T15:05:52Z

Is there any update on this?

WFA-hhsieh · 2023-07-03T03:38:33Z

@markandersontrocme Still got same issue on Kubernetes 1.24 with autoscaling/v2.

max-rocket-internet · 2023-09-25T17:31:09Z

So much has changed in Kubernetes since I opened this issue but I guess some things never change: I have this issue again 😅

max-rocket-internet · 2023-09-25T17:33:00Z

It seems some others also have the same problem. Rather than reopen an old issue with tonnes of comments I've created a new one to start fresh with autoscaling/v2: #120875

max-rocket-internet added the kind/bug Categorizes issue or PR as related to a bug. label Jun 6, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 6, 2019

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 6, 2019

This was referenced Jun 13, 2019

HPA scales up with no reason / blank reason #78712

Closed

Fix HPA feedback from writing status.replicas to spec.replicas. #79035

Merged

max-rocket-internet changed the title ~~HPA scales to maximum even though metric is under target~~ HPA doesn't scale down to minReplicas even though metric is under target Jun 17, 2019

josephburnett mentioned this issue Jun 19, 2019

Unscheduled Pods cause HPA to get stuck #79158

Closed

max-rocket-internet mentioned this issue Jul 16, 2019

HPA should have scale down/up limits #39090

Closed

max-rocket-internet mentioned this issue Sep 5, 2019

pod hpa would create extra pods during deployment rolling update when there is no load at all during the rolling upgrade #72775

Closed

mzq592 mentioned this issue Oct 10, 2019

HPA scales up when it is expected to scale down. #83698

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 6, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2022

rewanthtammana mentioned this issue Jan 4, 2023

Add Horizontal Pod Autoscaling based on Prometheus scrape times giantswarm/starboard-exporter#172

Merged

3 tasks

max-rocket-internet mentioned this issue Sep 26, 2023

HPA stuck at maxReplicas even though metric under target #120875

Open

HPA doesn't scale down to minReplicas even though metric is under target #78761

HPA doesn't scale down to minReplicas even though metric is under target #78761

Comments

max-rocket-internet commented Jun 6, 2019

max-rocket-internet commented Jun 6, 2019

Uh oh!

k8s-ci-robot commented Jun 6, 2019

Uh oh!

max-rocket-internet commented Jun 6, 2019

Uh oh!

tedyu commented Jun 7, 2019

Uh oh!

max-rocket-internet commented Jun 12, 2019

Uh oh!

max-rocket-internet commented Jun 12, 2019

Uh oh!

tedyu commented Jun 12, 2019

Uh oh!

max-rocket-internet commented Jun 12, 2019

Uh oh!

max-rocket-internet commented Jun 18, 2019

Uh oh!

max-rocket-internet commented Aug 16, 2019

Uh oh!

vdemonchy commented Aug 21, 2019

Uh oh!

SocietyCao commented Aug 23, 2019

Uh oh!

max-rocket-internet commented Aug 23, 2019

Uh oh!

SocietyCao commented Aug 23, 2019

Uh oh!

cmanzi commented Aug 31, 2019

Uh oh!

wxwang33 commented Sep 27, 2019

Uh oh!

cmanzi commented Sep 27, 2019

Uh oh!

wxwang33 commented Sep 27, 2019

Uh oh!

itninja-hue commented Oct 18, 2019

Uh oh!

rosoft2001 commented Jun 21, 2022

Uh oh!

coding-red-panda commented Jun 22, 2022

Uh oh!

vitobotta commented Jun 22, 2022

Uh oh!

coding-red-panda commented Jun 22, 2022

Uh oh!

vitobotta commented Jun 22, 2022

Uh oh!

otakuinside commented Jul 7, 2022

Uh oh!

k8s-triage-robot commented Aug 6, 2022

Uh oh!

jimethn commented Sep 1, 2022

Uh oh!

ismashal commented Sep 7, 2022

Uh oh!

markandersontrocme commented Sep 16, 2022

Uh oh!

k8s-triage-robot commented Oct 16, 2022

Uh oh!

k8s-ci-robot commented Oct 16, 2022

Uh oh!

h0jeZvgoxFepBQ2C commented Oct 16, 2022

Uh oh!

h0jeZvgoxFepBQ2C commented Oct 16, 2022

Uh oh!

k8s-ci-robot commented Oct 16, 2022

Uh oh!

h0jeZvgoxFepBQ2C commented Oct 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coding-red-panda commented Oct 17, 2022

Uh oh!

dani-newman commented Nov 3, 2022

Uh oh!

h0jeZvgoxFepBQ2C commented Nov 3, 2022

h0jeZvgoxFepBQ2C commented Oct 16, 2022 •

edited

Loading