Skip to content

HPA stuck at maxReplicas even though metric under target #120875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
max-rocket-internet opened this issue Sep 25, 2023 · 34 comments
Open

HPA stuck at maxReplicas even though metric under target #120875

max-rocket-internet opened this issue Sep 25, 2023 · 34 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@max-rocket-internet
Copy link

max-rocket-internet commented Sep 25, 2023

What happened?

HPA does not reduce Deployment replica count even though resource metric is below target. It is stuck at maxReplicas.

What did you expect to happen?

Deployment replica count should be reduced.

How can we reproduce it (as minimally and precisely as possible)?

We can see multiple examples in our clusters but not sure how to reproduce it exactly.

Here's some relevant kubectl output:

$ kubectl top pod -l app=application-one,country=ar
NAME                                                  CPU(cores)   MEMORY(bytes)
application-one-ar-76fd9bc76d-25wnr                   35m          909Mi
application-one-ar-76fd9bc76d-4d72r                   42m          778Mi
application-one-ar-76fd9bc76d-6pt7r                   35m          1189Mi
application-one-ar-76fd9bc76d-6z2mr                   29m          793Mi
application-one-ar-76fd9bc76d-cv6r9                   29m          837Mi
application-one-ar-76fd9bc76d-hrpd9                   32m          824Mi
application-one-ar-76fd9bc76d-mrgt8                   45m          1180Mi
application-one-ar-76fd9bc76d-qwjbs                   43m          1186Mi
application-one-ar-76fd9bc76d-sqf5h                   41m          797Mi
application-one-ar-76fd9bc76d-tlr6k                   39m          920Mi
application-one-collect-metrics-ar-7df9868cbf-h4db8   7m           595Mi


$ kubectl get deployment application-one-ar
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
application-one-ar   10/10   10           10          565d

$ kubectl get hpa application-one-ar
NAME                      REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
application-one-ar   Deployment/application-one-ar   20%/70%   3         10        10         565d

$ kubectl get pod -l app=application-one,country=ar
NAME                                             READY   STATUS      RESTARTS   AGE
application-one-ar-76fd9bc76d-25wnr              1/1     Running     0          3h56m
application-one-ar-76fd9bc76d-4d72r              1/1     Running     0          147m
application-one-ar-76fd9bc76d-6pt7r              1/1     Running     0          4d
application-one-ar-76fd9bc76d-6z2mr              1/1     Running     0          147m
application-one-ar-76fd9bc76d-cv6r9              1/1     Running     0          176m
application-one-ar-76fd9bc76d-hrpd9              1/1     Running     0          147m
application-one-ar-76fd9bc76d-mrgt8              1/1     Running     0          3d4h
application-one-ar-76fd9bc76d-qwjbs              1/1     Running     0          4d
application-one-ar-76fd9bc76d-sqf5h              1/1     Running     0          146m
application-one-ar-76fd9bc76d-tlr6k              1/1     Running     0          3h56m
application-one-clean-up-ar-28257720-96544       0/1     Completed   0          2d5h
application-one-clean-up-ar-28259160-zj2f4       0/1     Completed   0          29h
application-one-clean-up-ar-28260600-687q4       0/1     Completed   0          5h19m
application-one-xxxx-ar-7df9868cbf-h4db8         1/1     Running     0          2d11h
application-one-yyyyy-ar-28240560-5vtwj          0/1     Completed   0          14d
application-one-zzzzz-ar-28250640-sr28d          0/1     Completed   0          7d3h

The Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "115"
    meta.helm.sh/release-name: application-one-ar
    meta.helm.sh/release-namespace: default
  labels:
    app: application-one
    app.kubernetes.io/managed-by: Helm
    country: ar
    custom_app: application-one
    custom_country: ar
    custom_env: production
    custom_region: cloud-region-one
    environment: production
  name: application-one-ar
  namespace: default
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 10
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app: application-one
      country: ar
  strategy:
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2023-06-25T14:57:33+02:00"
        prometheus.io/path: /prometheus
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: application-one
        country: ar
        custom_app: application-one
        custom_country: ar
        custom_env: production
        custom_region: cloud-region-one
        environment: production
        module: web
    spec:
      containers:
      - env: # removed
        imagePullPolicy: Always
        name: application-one-ar-production
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "3"
            memory: 1536M
          requests:
            cpu: 170m
            memory: 1536M
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30

status:
  availableReplicas: 10
  conditions:
  - lastTransitionTime: "2022-03-09T12:49:18Z"
    lastUpdateTime: "2023-09-21T14:24:09Z"
    message: ReplicaSet "application-one-ar-76fd9bc76d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-09-25T12:54:20Z"
    lastUpdateTime: "2023-09-25T12:54:20Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 8975
  readyReplicas: 10
  replicas: 10
  updatedReplicas: 10

The HorizontalPodAutoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: application-one-ar
    meta.helm.sh/release-namespace: default
  labels:
    app: application-one
    app.kubernetes.io/managed-by: Helm
    country: ar
    custom_app: application-one
    custom_country: ar
    custom_env: production
    custom_region: cloud-region-one
    environment: production
  name: application-one-ar
  namespace: default
spec:
  maxReplicas: 10
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 70
        type: Utilization
    type: Resource
  minReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: application-one-ar
status:
  conditions:
  - lastTransitionTime: "2022-03-09T12:49:33Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2023-09-21T14:22:44Z"
    message: the HPA was able to successfully calculate a replica count from cpu resource
      utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2023-09-25T10:52:44Z"
    message: the desired count is within the acceptable range
    reason: DesiredWithinRange
    status: "False"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 21
        averageValue: 33m
      name: cpu
    type: Resource
  currentReplicas: 10
  desiredReplicas: 10
  lastScaleTime: "2023-09-19T10:49:13Z"

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:21:56Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"darwin/arm64"}

Kustomize Version: v4.5.4

Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.12-eks-2d98532", GitCommit:"0aa16cf4fac4da27b9e9e9ba570b990867f6a3d8", GitTreeState:"clean", BuildDate:"2023-07-28T16:52:04Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS EKS

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

containerd://1.6.19

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@max-rocket-internet max-rocket-internet added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 25, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@max-rocket-internet
Copy link
Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 25, 2023
@max-rocket-internet
Copy link
Author

So is this the old classic HPA issue of the deployment selector labels matching pods outside of that deployment? e.g. application-one-xxxx-ar-7df9868cbf-h4db8 in the above output?

@max-rocket-internet
Copy link
Author

If it is caused by spec.selector.matchLabels labels of Deployment/application-one-ar matching pods outside of that deployment, then this is quite disappointing:

  • It's very poor UX to see the metric presented as way under the target but still desiredReplicas=maxReplicas. This makes no sense to a user.
  • This issue seems to be over 4 years old, constant comments in that issue also
  • I think it's mentioned in the docs here but I think many users are unlikely to fully understand what is written and the implication of it

As I said here (3 years ago!):

The HPA scaleTargetRef uses deployment by name and that should be enough without having to worry about selectors.

I think this comment is still valid because from a user perspective a deployment is specified by name but underneath the pods are selected by something (potentially) totally different. This is kind of an obfuscation.

At a minimum, do you think a feature request to show this situation loud and clear in the status of the HorizontalPodAutoscaler would be valid?

@rochacon
Copy link

@max-rocket-internet I've been following #78761 since I experienced the same problem before and I agree that this should be treated as a bug.

While the selector conflict may be seen as a configuration bug on the part of the operator, and easily worked around by ensuring a minimal set of unique selectors between all Deployments of a given namespace, e.g. I always include app.kubernetes.io/component so I can differentiate between several processes from the same application, the selection behavior of the HPA controller does not match the Deployment/ReplicaSet controllers behaviors.

I did a quick look at the code and found out that the autoscaler controller uses Deployments /scale API to retrieve the selectors, which pretty much returns the selectors from the Deployment.spec.selectors right now. I believe this API should include the pod-template-hash label from the most recent fully progressed ReplicaSet in the .status.selector field. This would ensure the autoscaling controller is only selecting pods from the most recent healthy version and constrained to the referenced Deployment. I'm unsure if pod-template-hash can include the same value in some cases.

The side-effect I can think off with the addition of pod-template-hash on the scale selector is the HPA controller fighting the Deployment controller replica count reductions during rollout progressions, which might take a while to complete on Deployments with large number of replica.

@IgalSc
Copy link

IgalSc commented Sep 28, 2023

What worked for me to resolve the issue with HPA not scaling down despite the fact the CPU/Memory utilization was below target was to remove
spec.replicas from the deployment

@IgalSc
Copy link

IgalSc commented Sep 29, 2023

Commenting on my previous statement - the moment the deployment scales up, the hpa does not scale it down until the spec.replicas is removed again

@wpferreira01
Copy link

I solved my problem like this

behavior:
    scaleDown:
      stabilizationWindowSeconds: 0
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
      selectPolicy: Min

@max-rocket-internet
Copy link
Author

max-rocket-internet commented Oct 4, 2023

Commenting on my previous statement - the moment the deployment scales up, the hpa does not scale it down until the spec.replicas is removed again

So it's not resolved for you then.

@BogdanGeorge
Copy link

BogdanGeorge commented Oct 10, 2023

I get the same problem with a HPA scaling based on memory

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sessions-bus-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sessions-bus
  maxReplicas: 3
  minReplicas: 1
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 90
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 0
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
sessions-bus-autoscaler Deployment/sessions-bus 50%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 54%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 52%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 51%/90% 1 3 3 44h
sessions-bus-autoscaler Deployment/sessions-bus 50%/90% 1 3 3 44h

All the running pods are with the memory under the target and still doesn't scale down. I also tried wpferreira01 workaround even though i have only one policy per scale type but it didn't helped.

@gsGabriel
Copy link

i got the same issue

@George-Spanos
Copy link

To maybe help the conversation, I was investigating on that matter as well for the last 4 hours. I'm using minikube and I'm fairly new to k8s. For my case I have an app that scaled on CPU utilization only. It needed 5-10 minutes to scale back to minimum replicas but eventually it did.

@sharadregoti
Copy link

I am also having similar problem, but i am using custom metrics from prometheus adapter.

After scaling to the max pods, it is not scaling down. Even though metric is current zero below the threshold defined in hpa.

And I also have checked the label issue @max-rocket-internet was talking about. I don't have any other pods with the same labels.

Here is my HPA spec,

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling/v2","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"go-http-server-hpa","namespace":"default"},"spec":{"maxReplicas":10,"metrics":[{"pods":{"metric":{"name":"http_requests_per_second"},"target":{"averageValue":10,"type":"AverageValue"}},"type":"Pods"}],"minReplicas":1,"scaleTargetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"go-http-server"}}}
  creationTimestamp: "2024-01-28T06:57:36Z"
  name: go-http-server-hpa
  namespace: default
  resourceVersion: "47420"
  uid: 6464aa24-80da-4720-b409-51be262fef65
spec:
  maxReplicas: 10
  metrics:
  - pods:
      metric:
        name: http_requests_per_second
      target:
        averageValue: "10"
        type: AverageValue
    type: Pods
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: go-http-server
status:
  conditions:
  - lastTransitionTime: "2024-01-28T06:58:30Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2024-01-28T06:58:30Z"
    message: the HPA was able to successfully calculate a replica count from pods metric http_requests_per_second
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2024-01-28T06:58:30Z"
    message: the desired count is within the acceptable range
    reason: DesiredWithinRange
    status: "False"
    type: ScalingLimited
  currentMetrics:
  - pods:
      current:
        averageValue: "0"
      metric:
        name: http_requests_per_second
    type: Pods
  currentReplicas: 10
  desiredReplicas: 10

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"go-http-server","namespace":"default"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"go-http-server"}},"template":{"metadata":{"labels":{"app":"go-http-server","prometheus.io/scrape":"true"}},"spec":{"containers":[{"image":"sharadregoti/go-http-server:lab-03","imagePullPolicy":"IfNotPresent","name":"go-http-server","ports":[{"containerPort":8080,"name":"http-metrics"}]}]}}}}
  creationTimestamp: "2024-01-27T07:21:45Z"
  generation: 18
  name: go-http-server
  namespace: default
  resourceVersion: "44684"
  uid: 903f1c61-9bd9-4e54-be02-d0ae702557be
spec:
  progressDeadlineSeconds: 600
  replicas: 10
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: go-http-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: go-http-server
        prometheus.io/scrape: "true"
    spec:
      containers:
      - image: sharadregoti/go-http-server:lab-03
        imagePullPolicy: IfNotPresent
        name: go-http-server
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 10
  conditions:
  - lastTransitionTime: "2024-01-27T07:21:45Z"
    lastUpdateTime: "2024-01-27T08:06:09Z"
    message: ReplicaSet "go-http-server-67b955d8df" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-01-28T06:23:32Z"
    lastUpdateTime: "2024-01-28T06:23:32Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 18
  readyReplicas: 10
  replicas: 10
  updatedReplicas: 10

The below curl request indicates, the custom metric value is zero

curl -X GET "http://localhost:8001/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second?labelSelector=app=go-http-server"   -H "Authorization: Bearer $TOKEN"   -H "Accept: application/json"
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "go-http-server-67b955d8df-qjt6t",
        "apiVersion": "/v1"
      },
      "metricName": "http_requests_per_second",
      "timestamp": "2024-01-28T06:57:12Z",
      "value": "0",
      "selector": null
    }
  ]
}

@ccmcbeck
Copy link

ccmcbeck commented Apr 17, 2024

We have the same problem on K8s 1.25 and using HPA autoscaling/v2. The OP is on 1.25 also.

Is anyone on 1.26+ having the issue as well?

The only solution we could find was to disambiguate our matchLabels by including another label so there is no possible substring match on label combination.

The saddest part about this workaround is that matchLabels are immutable. So you have to delete the Deployment and suffer the downtime until the new Deployment has rolled out. Not only that you lose all your Helm history so a rollback is not possible. Bleh

@IgalSc
Copy link

IgalSc commented Apr 18, 2024

I moved from k8s to EKS on AWS, with 3 clusters, one of them 1.28 and two 1.29
hpa autoscaling/v2, same behaviour on all three

@dims
Copy link
Member

dims commented Apr 18, 2024

I moved from k8s to EKS on AWS, with 3 clusters, one of them 1.28 and two 1.29

Please reach out to AWS/EKS support @IgalSc

@IgalSc
Copy link

IgalSc commented Apr 18, 2024

@dims
sorry, why do i need to reach to AWS/EKS support?
We are talking about hpa autoscaling/v2 not scaling down. That's nothing to do with AWS or EKS

@ccmcbeck
Copy link

FWIW, our AWS Support rep pointed us to this thread. LOL

@ccmcbeck
Copy link

But, to be fair, AWS contributes a lot of code to the K8s upstream https://chat.openai.com/share/52255931-cf9a-4a60-a450-730b2bb10220. We will escalate this within AWS and report back.

@gsGabriel
Copy link

I don't know if this is the case with AWS support... I'm using Azure and the same thing happens. Throughout the thread we have replication in minikube, I imagine it is something with kubernetes itself

@ccmcbeck
Copy link

I don't know if this is the case with AWS support... I'm using Azure and the same thing happens. Throughout the thread we have replication in minikube, I imagine it is something with kubernetes itself

Roger that. Maybe we can "prod" AWS to submit a fix. Meanwhile, I guess we have to perform a workaround similar to what I am proposing in #120875 (comment)

@ccmcbeck
Copy link

The saddest part about this workaround is that matchLabels are immutable. So you have to delete the Deployment and suffer the downtime until the new Deployment has rolled out. Not only that you lose all your Helm history so a rollback is not possible. Bleh

For ZERO downtime on Production I guess you can try

  1. A second deployment with a different "instance name"
  2. Migrate to that second deployment
  3. Destroy the original deployment
  4. Redeploy the original with new matchLabels
  5. Migrate from second back to original
  6. Destroy the second

Imagine doing that for 50 microservices on Production on both your primary and failover clusters?

Just typing this out makes me wanna cry.

ivanELEC pushed a commit to ministryofjustice/laa-submit-crime-forms that referenced this issue May 20, 2024
## Description of change

 [crm457-1508](https://dsdmoj.atlassian.net/browse/CRM457-1508)

## Notes for reviewer
[Trying to differentiate between deployments for hpa to work
properly](kubernetes/kubernetes#120875 (comment))

## Screenshots of changes (if applicable)

### Before changes:

### After changes:

## How to manually test the feature
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2024
@awsitcloudpro
Copy link

/remove-lifecycle stale

This is still observed in K8s v1.29 and 1.30. There are no ambiguous label selectors in deployments affected by this, so some of the earlier comments in this thread about label matching do not apply.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2024
@ariretiarno
Copy link

Same issue here..
The pods are below 80%, but they are not scaling down. 80% of 6GiB is approximately 4915.2MB, which should trigger scaling up. However, since the pods are below this value, the HPA (Horizontal Pod Autoscaler) should be scaling down.

image

@kholisrag
Copy link

kholisrag commented Nov 19, 2024

leave a trace here...

got same issue with this, in GKE 1.28

the fix is to update the deployments .spec.selector.matchLabels as mentioned in the below comment

If it is caused by spec.selector.matchLabels labels of Deployment/application-one-ar matching pods outside of that deployment, then this is quite disappointing:

  • It's very poor UX to see the metric presented as way under the target but still desiredReplicas=maxReplicas. This makes no sense to a user.
  • This issue seems to be over 4 years old, constant comments in that issue also
  • I think it's mentioned in the docs here but I think many users are unlikely to fully understand what is written and the implication of it

As I said here (3 years ago!):

The HPA scaleTargetRef uses deployment by name and that should be enough without having to worry about selectors.

I think this comment is still valid because from a user perspective a deployment is specified by name but underneath the pods are selected by something (potentially) totally different. This is kind of an obfuscation.

At a minimum, do you think a feature request to show this situation loud and clear in the status of the HorizontalPodAutoscaler would be valid?

hopefully help and clarify others that jumping here recently....

@yahiya-ayoub
Copy link

Anyone found a workaround or fix on this, I upgraded my kubernetes to v1.30 and deduct load testing on the environment then the autoscaler start to acting weird and keep scaling down and up on schedule even the metrics don't seem to be over the threshold. first of all I though it is a metrics server issue but I did upgrade it and tried to do scaling policies but without luck to fix the issue.
Any update please?

@gsGabriel
Copy link

Anyone found a workaround or fix on this, I upgraded my kubernetes to v1.30 and deduct load testing on the environment then the autoscaler start to acting weird and keep scaling down and up on schedule even the metrics don't seem to be over the threshold. first of all I though it is a metrics server issue but I did upgrade it and tried to do scaling policies but without luck to fix the issue. Any update please?

I did it using Keda autoscalling instead of default HPA

@ccmcbeck
Copy link

Wow. Keda. Gracias, @gsGabriel Here is a YouTube video from the DevOps Toolkit about Keda https://youtu.be/3lcaawKAv6s?si=qWZ2as6AixzH_6EN

@yahiya-ayoub
Copy link

Thank you, @gsGabriel and @ccmcbeck, for the quick reply. I want to hint at the root cause of my issue here for anyone who may be stuck like me.
One of my team members faced a problem while implementing readiness and liveness for our pods. They mistakenly set the deployment's replica field to the maximum value of the Horizontal Pod Autoscaler (HPA) but forgot to revert it to the original value. This oversight caused unusual behavior, as the HPA kept attempting to scale down the replica count. After about four minutes, the replica count would revert to the value stored in the deployment's replica field. The solution was to adjust the replica number back to match the HPA's minReplicas value.

Thank you again, and I hope this will be helpful for others.

@imdmahajankanika
Copy link

imdmahajankanika commented Jan 23, 2025

Hello! I am having this issue with resource type memory. With resource type cpu, it is scaling up an down normally.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2025
@omerap12
Copy link
Member

Spoke with @adrianmoisey (who also raised this in #124307) and will open a KEP to explore how to solve this issue.
/assign

@omerap12
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

No branches or pull requests

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy