HPA stuck at maxReplicas even though metric under target #120875

max-rocket-internet · 2023-09-25T15:20:20Z

What happened?

HPA does not reduce Deployment replica count even though resource metric is below target. It is stuck at maxReplicas.

What did you expect to happen?

Deployment replica count should be reduced.

How can we reproduce it (as minimally and precisely as possible)?

We can see multiple examples in our clusters but not sure how to reproduce it exactly.

Here's some relevant kubectl output:

$ kubectl top pod -l app=application-one,country=ar
NAME                                                  CPU(cores)   MEMORY(bytes)
application-one-ar-76fd9bc76d-25wnr                   35m          909Mi
application-one-ar-76fd9bc76d-4d72r                   42m          778Mi
application-one-ar-76fd9bc76d-6pt7r                   35m          1189Mi
application-one-ar-76fd9bc76d-6z2mr                   29m          793Mi
application-one-ar-76fd9bc76d-cv6r9                   29m          837Mi
application-one-ar-76fd9bc76d-hrpd9                   32m          824Mi
application-one-ar-76fd9bc76d-mrgt8                   45m          1180Mi
application-one-ar-76fd9bc76d-qwjbs                   43m          1186Mi
application-one-ar-76fd9bc76d-sqf5h                   41m          797Mi
application-one-ar-76fd9bc76d-tlr6k                   39m          920Mi
application-one-collect-metrics-ar-7df9868cbf-h4db8   7m           595Mi


$ kubectl get deployment application-one-ar
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
application-one-ar   10/10   10           10          565d

$ kubectl get hpa application-one-ar
NAME                      REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
application-one-ar   Deployment/application-one-ar   20%/70%   3         10        10         565d

$ kubectl get pod -l app=application-one,country=ar
NAME                                             READY   STATUS      RESTARTS   AGE
application-one-ar-76fd9bc76d-25wnr              1/1     Running     0          3h56m
application-one-ar-76fd9bc76d-4d72r              1/1     Running     0          147m
application-one-ar-76fd9bc76d-6pt7r              1/1     Running     0          4d
application-one-ar-76fd9bc76d-6z2mr              1/1     Running     0          147m
application-one-ar-76fd9bc76d-cv6r9              1/1     Running     0          176m
application-one-ar-76fd9bc76d-hrpd9              1/1     Running     0          147m
application-one-ar-76fd9bc76d-mrgt8              1/1     Running     0          3d4h
application-one-ar-76fd9bc76d-qwjbs              1/1     Running     0          4d
application-one-ar-76fd9bc76d-sqf5h              1/1     Running     0          146m
application-one-ar-76fd9bc76d-tlr6k              1/1     Running     0          3h56m
application-one-clean-up-ar-28257720-96544       0/1     Completed   0          2d5h
application-one-clean-up-ar-28259160-zj2f4       0/1     Completed   0          29h
application-one-clean-up-ar-28260600-687q4       0/1     Completed   0          5h19m
application-one-xxxx-ar-7df9868cbf-h4db8         1/1     Running     0          2d11h
application-one-yyyyy-ar-28240560-5vtwj          0/1     Completed   0          14d
application-one-zzzzz-ar-28250640-sr28d          0/1     Completed   0          7d3h

The Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "115"
    meta.helm.sh/release-name: application-one-ar
    meta.helm.sh/release-namespace: default
  labels:
    app: application-one
    app.kubernetes.io/managed-by: Helm
    country: ar
    custom_app: application-one
    custom_country: ar
    custom_env: production
    custom_region: cloud-region-one
    environment: production
  name: application-one-ar
  namespace: default
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 10
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app: application-one
      country: ar
  strategy:
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2023-06-25T14:57:33+02:00"
        prometheus.io/path: /prometheus
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: application-one
        country: ar
        custom_app: application-one
        custom_country: ar
        custom_env: production
        custom_region: cloud-region-one
        environment: production
        module: web
    spec:
      containers:
      - env: # removed
        imagePullPolicy: Always
        name: application-one-ar-production
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "3"
            memory: 1536M
          requests:
            cpu: 170m
            memory: 1536M
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30

status:
  availableReplicas: 10
  conditions:
  - lastTransitionTime: "2022-03-09T12:49:18Z"
    lastUpdateTime: "2023-09-21T14:24:09Z"
    message: ReplicaSet "application-one-ar-76fd9bc76d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-09-25T12:54:20Z"
    lastUpdateTime: "2023-09-25T12:54:20Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 8975
  readyReplicas: 10
  replicas: 10
  updatedReplicas: 10

The HorizontalPodAutoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: application-one-ar
    meta.helm.sh/release-namespace: default
  labels:
    app: application-one
    app.kubernetes.io/managed-by: Helm
    country: ar
    custom_app: application-one
    custom_country: ar
    custom_env: production
    custom_region: cloud-region-one
    environment: production
  name: application-one-ar
  namespace: default
spec:
  maxReplicas: 10
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 70
        type: Utilization
    type: Resource
  minReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: application-one-ar
status:
  conditions:
  - lastTransitionTime: "2022-03-09T12:49:33Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2023-09-21T14:22:44Z"
    message: the HPA was able to successfully calculate a replica count from cpu resource
      utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2023-09-25T10:52:44Z"
    message: the desired count is within the acceptable range
    reason: DesiredWithinRange
    status: "False"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 21
        averageValue: 33m
      name: cpu
    type: Resource
  currentReplicas: 10
  desiredReplicas: 10
  lastScaleTime: "2023-09-19T10:49:13Z"

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:21:56Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"darwin/arm64"}

Kustomize Version: v4.5.4

Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.12-eks-2d98532", GitCommit:"0aa16cf4fac4da27b9e9e9ba570b990867f6a3d8", GitTreeState:"clean", BuildDate:"2023-07-28T16:52:04Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS EKS

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

containerd://1.6.19

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-09-25T15:20:29Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

max-rocket-internet · 2023-09-25T15:24:48Z

/sig autoscaling

max-rocket-internet · 2023-09-26T08:39:19Z

So is this the old classic HPA issue of the deployment selector labels matching pods outside of that deployment? e.g. application-one-xxxx-ar-7df9868cbf-h4db8 in the above output?

max-rocket-internet · 2023-09-26T09:03:52Z

If it is caused by spec.selector.matchLabels labels of Deployment/application-one-ar matching pods outside of that deployment, then this is quite disappointing:

It's very poor UX to see the metric presented as way under the target but still desiredReplicas=maxReplicas. This makes no sense to a user.
This issue seems to be over 4 years old, constant comments in that issue also
I think it's mentioned in the docs here but I think many users are unlikely to fully understand what is written and the implication of it

As I said here (3 years ago!):

The HPA scaleTargetRef uses deployment by name and that should be enough without having to worry about selectors.

I think this comment is still valid because from a user perspective a deployment is specified by name but underneath the pods are selected by something (potentially) totally different. This is kind of an obfuscation.

At a minimum, do you think a feature request to show this situation loud and clear in the status of the HorizontalPodAutoscaler would be valid?

rochacon · 2023-09-26T17:57:34Z

@max-rocket-internet I've been following #78761 since I experienced the same problem before and I agree that this should be treated as a bug.

While the selector conflict may be seen as a configuration bug on the part of the operator, and easily worked around by ensuring a minimal set of unique selectors between all Deployments of a given namespace, e.g. I always include app.kubernetes.io/component so I can differentiate between several processes from the same application, the selection behavior of the HPA controller does not match the Deployment/ReplicaSet controllers behaviors.

I did a quick look at the code and found out that the autoscaler controller uses Deployments /scale API to retrieve the selectors, which pretty much returns the selectors from the Deployment.spec.selectors right now. I believe this API should include the pod-template-hash label from the most recent fully progressed ReplicaSet in the .status.selector field. This would ensure the autoscaling controller is only selecting pods from the most recent healthy version and constrained to the referenced Deployment. I'm unsure if pod-template-hash can include the same value in some cases.

The side-effect I can think off with the addition of pod-template-hash on the scale selector is the HPA controller fighting the Deployment controller replica count reductions during rollout progressions, which might take a while to complete on Deployments with large number of replica.

IgalSc · 2023-09-28T21:20:26Z

What worked for me to resolve the issue with HPA not scaling down despite the fact the CPU/Memory utilization was below target was to remove
spec.replicas from the deployment

IgalSc · 2023-09-29T15:37:52Z

Commenting on my previous statement - the moment the deployment scales up, the hpa does not scale it down until the spec.replicas is removed again

wpferreira01 · 2023-10-03T18:45:20Z

I solved my problem like this

behavior:
    scaleDown:
      stabilizationWindowSeconds: 0
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
      selectPolicy: Min

max-rocket-internet · 2023-10-04T09:14:37Z

Commenting on my previous statement - the moment the deployment scales up, the hpa does not scale it down until the spec.replicas is removed again

So it's not resolved for you then.

BogdanGeorge · 2023-10-10T06:34:11Z

I get the same problem with a HPA scaling based on memory

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sessions-bus-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sessions-bus
  maxReplicas: 3
  minReplicas: 1
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 90
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 0
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
sessions-bus-autoscaler Deployment/sessions-bus 50%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 54%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 52%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 51%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 50%/90% 1 3 3 44h

All the running pods are with the memory under the target and still doesn't scale down. I also tried wpferreira01 workaround even though i have only one policy per scale type but it didn't helped.

gsGabriel · 2023-12-01T13:26:59Z

i got the same issue

George-Spanos · 2023-12-17T13:38:31Z

To maybe help the conversation, I was investigating on that matter as well for the last 4 hours. I'm using minikube and I'm fairly new to k8s. For my case I have an app that scaled on CPU utilization only. It needed 5-10 minutes to scale back to minimum replicas but eventually it did.

sharadregoti · 2024-01-28T07:05:49Z

I am also having similar problem, but i am using custom metrics from prometheus adapter.

After scaling to the max pods, it is not scaling down. Even though metric is current zero below the threshold defined in hpa.

And I also have checked the label issue @max-rocket-internet was talking about. I don't have any other pods with the same labels.

Here is my HPA spec,

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling/v2","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"go-http-server-hpa","namespace":"default"},"spec":{"maxReplicas":10,"metrics":[{"pods":{"metric":{"name":"http_requests_per_second"},"target":{"averageValue":10,"type":"AverageValue"}},"type":"Pods"}],"minReplicas":1,"scaleTargetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"go-http-server"}}}
  creationTimestamp: "2024-01-28T06:57:36Z"
  name: go-http-server-hpa
  namespace: default
  resourceVersion: "47420"
  uid: 6464aa24-80da-4720-b409-51be262fef65
spec:
  maxReplicas: 10
  metrics:
  - pods:
      metric:
        name: http_requests_per_second
      target:
        averageValue: "10"
        type: AverageValue
    type: Pods
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: go-http-server
status:
  conditions:
  - lastTransitionTime: "2024-01-28T06:58:30Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2024-01-28T06:58:30Z"
    message: the HPA was able to successfully calculate a replica count from pods metric http_requests_per_second
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2024-01-28T06:58:30Z"
    message: the desired count is within the acceptable range
    reason: DesiredWithinRange
    status: "False"
    type: ScalingLimited
  currentMetrics:
  - pods:
      current:
        averageValue: "0"
      metric:
        name: http_requests_per_second
    type: Pods
  currentReplicas: 10
  desiredReplicas: 10

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"go-http-server","namespace":"default"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"go-http-server"}},"template":{"metadata":{"labels":{"app":"go-http-server","prometheus.io/scrape":"true"}},"spec":{"containers":[{"image":"sharadregoti/go-http-server:lab-03","imagePullPolicy":"IfNotPresent","name":"go-http-server","ports":[{"containerPort":8080,"name":"http-metrics"}]}]}}}}
  creationTimestamp: "2024-01-27T07:21:45Z"
  generation: 18
  name: go-http-server
  namespace: default
  resourceVersion: "44684"
  uid: 903f1c61-9bd9-4e54-be02-d0ae702557be
spec:
  progressDeadlineSeconds: 600
  replicas: 10
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: go-http-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: go-http-server
        prometheus.io/scrape: "true"
    spec:
      containers:
      - image: sharadregoti/go-http-server:lab-03
        imagePullPolicy: IfNotPresent
        name: go-http-server
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 10
  conditions:
  - lastTransitionTime: "2024-01-27T07:21:45Z"
    lastUpdateTime: "2024-01-27T08:06:09Z"
    message: ReplicaSet "go-http-server-67b955d8df" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-01-28T06:23:32Z"
    lastUpdateTime: "2024-01-28T06:23:32Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 18
  readyReplicas: 10
  replicas: 10
  updatedReplicas: 10

The below curl request indicates, the custom metric value is zero

curl -X GET "http://localhost:8001/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second?labelSelector=app=go-http-server"   -H "Authorization: Bearer $TOKEN"   -H "Accept: application/json"
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "go-http-server-67b955d8df-qjt6t",
        "apiVersion": "/v1"
      },
      "metricName": "http_requests_per_second",
      "timestamp": "2024-01-28T06:57:12Z",
      "value": "0",
      "selector": null
    }
  ]
}

ccmcbeck · 2024-04-17T22:54:42Z

We have the same problem on K8s 1.25 and using HPA autoscaling/v2. The OP is on 1.25 also.

Is anyone on 1.26+ having the issue as well?

The only solution we could find was to disambiguate our matchLabels by including another label so there is no possible substring match on label combination.

The saddest part about this workaround is that matchLabels are immutable. So you have to delete the Deployment and suffer the downtime until the new Deployment has rolled out. Not only that you lose all your Helm history so a rollback is not possible. Bleh

IgalSc · 2024-04-18T03:19:55Z

I moved from k8s to EKS on AWS, with 3 clusters, one of them 1.28 and two 1.29
hpa autoscaling/v2, same behaviour on all three

dims · 2024-04-18T11:04:44Z

I moved from k8s to EKS on AWS, with 3 clusters, one of them 1.28 and two 1.29

Please reach out to AWS/EKS support @IgalSc

IgalSc · 2024-04-18T16:36:16Z

@dims
sorry, why do i need to reach to AWS/EKS support?
We are talking about hpa autoscaling/v2 not scaling down. That's nothing to do with AWS or EKS

ccmcbeck · 2024-04-18T21:42:54Z

FWIW, our AWS Support rep pointed us to this thread. LOL

ccmcbeck · 2024-04-19T11:50:38Z

But, to be fair, AWS contributes a lot of code to the K8s upstream https://chat.openai.com/share/52255931-cf9a-4a60-a450-730b2bb10220. We will escalate this within AWS and report back.

gsGabriel · 2024-04-19T14:49:35Z

I don't know if this is the case with AWS support... I'm using Azure and the same thing happens. Throughout the thread we have replication in minikube, I imagine it is something with kubernetes itself

ccmcbeck · 2024-04-21T15:56:13Z

I don't know if this is the case with AWS support... I'm using Azure and the same thing happens. Throughout the thread we have replication in minikube, I imagine it is something with kubernetes itself

Roger that. Maybe we can "prod" AWS to submit a fix. Meanwhile, I guess we have to perform a workaround similar to what I am proposing in #120875 (comment)

ccmcbeck · 2024-04-21T16:01:39Z

The saddest part about this workaround is that matchLabels are immutable. So you have to delete the Deployment and suffer the downtime until the new Deployment has rolled out. Not only that you lose all your Helm history so a rollback is not possible. Bleh

For ZERO downtime on Production I guess you can try

A second deployment with a different "instance name"
Migrate to that second deployment
Destroy the original deployment
Redeploy the original with new matchLabels
Migrate from second back to original
Destroy the second

Imagine doing that for 50 microservices on Production on both your primary and failover clusters?

Just typing this out makes me wanna cry.

## Description of change [crm457-1508](https://dsdmoj.atlassian.net/browse/CRM457-1508) ## Notes for reviewer [Trying to differentiate between deployments for hpa to work properly](kubernetes/kubernetes#120875 (comment)) ## Screenshots of changes (if applicable) ### Before changes: ### After changes: ## How to manually test the feature

k8s-triage-robot · 2024-09-03T01:42:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

awsitcloudpro · 2024-09-16T15:07:31Z

/remove-lifecycle stale

This is still observed in K8s v1.29 and 1.30. There are no ambiguous label selectors in deployments affected by this, so some of the earlier comments in this thread about label matching do not apply.

ariretiarno · 2024-10-08T14:14:54Z

Same issue here..
The pods are below 80%, but they are not scaling down. 80% of 6GiB is approximately 4915.2MB, which should trigger scaling up. However, since the pods are below this value, the HPA (Horizontal Pod Autoscaler) should be scaling down.

kholisrag · 2024-11-19T03:58:57Z

leave a trace here...

got same issue with this, in GKE 1.28

the fix is to update the deployments .spec.selector.matchLabels as mentioned in the below comment

If it is caused by spec.selector.matchLabels labels of Deployment/application-one-ar matching pods outside of that deployment, then this is quite disappointing:

It's very poor UX to see the metric presented as way under the target but still desiredReplicas=maxReplicas. This makes no sense to a user.

This issue seems to be over 4 years old, constant comments in that issue also

I think it's mentioned in the docs here but I think many users are unlikely to fully understand what is written and the implication of it

As I said here (3 years ago!):

The HPA scaleTargetRef uses deployment by name and that should be enough without having to worry about selectors.

I think this comment is still valid because from a user perspective a deployment is specified by name but underneath the pods are selected by something (potentially) totally different. This is kind of an obfuscation.

At a minimum, do you think a feature request to show this situation loud and clear in the status of the HorizontalPodAutoscaler would be valid?

hopefully help and clarify others that jumping here recently....

yahiya-ayoub · 2025-01-19T21:09:52Z

Anyone found a workaround or fix on this, I upgraded my kubernetes to v1.30 and deduct load testing on the environment then the autoscaler start to acting weird and keep scaling down and up on schedule even the metrics don't seem to be over the threshold. first of all I though it is a metrics server issue but I did upgrade it and tried to do scaling policies but without luck to fix the issue.
Any update please?

gsGabriel · 2025-01-19T21:52:52Z

Anyone found a workaround or fix on this, I upgraded my kubernetes to v1.30 and deduct load testing on the environment then the autoscaler start to acting weird and keep scaling down and up on schedule even the metrics don't seem to be over the threshold. first of all I though it is a metrics server issue but I did upgrade it and tried to do scaling policies but without luck to fix the issue. Any update please?

I did it using Keda autoscalling instead of default HPA

ccmcbeck · 2025-01-20T21:48:17Z

Wow. Keda. Gracias, @gsGabriel Here is a YouTube video from the DevOps Toolkit about Keda https://youtu.be/3lcaawKAv6s?si=qWZ2as6AixzH_6EN

yahiya-ayoub · 2025-01-21T13:53:23Z

Thank you, @gsGabriel and @ccmcbeck, for the quick reply. I want to hint at the root cause of my issue here for anyone who may be stuck like me.
One of my team members faced a problem while implementing readiness and liveness for our pods. They mistakenly set the deployment's replica field to the maximum value of the Horizontal Pod Autoscaler (HPA) but forgot to revert it to the original value. This oversight caused unusual behavior, as the HPA kept attempting to scale down the replica count. After about four minutes, the replica count would revert to the value stored in the deployment's replica field. The solution was to adjust the replica number back to match the HPA's minReplicas value.

Thank you again, and I hope this will be helpful for others.

imdmahajankanika · 2025-01-23T17:01:20Z

Hello! I am having this issue with resource type memory. With resource type cpu, it is scaling up an down normally.

k8s-triage-robot · 2025-04-23T17:25:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

omerap12 · 2025-05-18T12:01:31Z

Spoke with @adrianmoisey (who also raised this in #124307) and will open a KEP to explore how to solve this issue.
/assign

omerap12 · 2025-05-18T12:01:44Z

/remove-lifecycle stale

max-rocket-internet added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 25, 2023

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 25, 2023

max-rocket-internet mentioned this issue Sep 25, 2023

HPA doesn't scale down to minReplicas even though metric is under target #78761

Closed

This was referenced May 17, 2024

Differentiate between app and worker deployments in selector labels ministryofjustice/laa-submit-crime-forms#803

Closed

CRM457-1508 add extra labels for worker deployment ministryofjustice/laa-submit-crime-forms#807

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2024

onematchfox mentioned this issue Dec 16, 2024

HPA scales up despite utilization being under target #129228

Open

Aaina26 mentioned this issue Jan 29, 2025

HPA wrongly assumes that terminated pods have an utilization of 100% #129866

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2025

k8s-ci-robot assigned omerap12 May 18, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2025

omerap12 mentioned this issue May 21, 2025

HPA: Improve pod selection accuracy across workload types kubernetes/enhancements#5325

Open

4 tasks

HPA stuck at maxReplicas even though metric under target #120875

HPA stuck at maxReplicas even though metric under target #120875

Comments

max-rocket-internet commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Sep 25, 2023

Uh oh!

max-rocket-internet commented Sep 25, 2023

Uh oh!

max-rocket-internet commented Sep 26, 2023

Uh oh!

max-rocket-internet commented Sep 26, 2023

Uh oh!

rochacon commented Sep 26, 2023

Uh oh!

IgalSc commented Sep 28, 2023

Uh oh!

IgalSc commented Sep 29, 2023

Uh oh!

wpferreira01 commented Oct 3, 2023

Uh oh!

max-rocket-internet commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BogdanGeorge commented Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gsGabriel commented Dec 1, 2023

Uh oh!

George-Spanos commented Dec 17, 2023

Uh oh!

sharadregoti commented Jan 28, 2024

Uh oh!

ccmcbeck commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IgalSc commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dims commented Apr 18, 2024

Uh oh!

IgalSc commented Apr 18, 2024

Uh oh!

ccmcbeck commented Apr 18, 2024

Uh oh!

ccmcbeck commented Apr 19, 2024

Uh oh!

gsGabriel commented Apr 19, 2024

Uh oh!

ccmcbeck commented Apr 21, 2024

Uh oh!

ccmcbeck commented Apr 21, 2024

Uh oh!

k8s-triage-robot commented Sep 3, 2024

Uh oh!

awsitcloudpro commented Sep 16, 2024

Uh oh!

ariretiarno commented Oct 8, 2024

Uh oh!

kholisrag commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yahiya-ayoub commented Jan 19, 2025

Uh oh!

gsGabriel commented Jan 19, 2025

Uh oh!

ccmcbeck commented Jan 20, 2025

Uh oh!

yahiya-ayoub commented Jan 21, 2025

Uh oh!

max-rocket-internet commented Sep 25, 2023 •

edited

Loading

max-rocket-internet commented Oct 4, 2023 •

edited

Loading

BogdanGeorge commented Oct 10, 2023 •

edited

Loading

ccmcbeck commented Apr 17, 2024 •

edited

Loading

IgalSc commented Apr 18, 2024 •

edited

Loading

kholisrag commented Nov 19, 2024 •

edited

Loading

imdmahajankanika commented Jan 23, 2025 •

edited

Loading