Allow HPA to scale out when no matched Pods are ready #130130

zheyli · 2025-02-13T06:24:27Z

What happened?

One of our production pool cannot scale up when its metrics reached its threshold because all pods became all unready at that moment when there was a peak traffic. Dig into the source code, we found that HPA calculate the desired replica using the ready pod count so cause the recommend replica is always 0.

func (c *ReplicaCalculator) getUsageRatioReplicaCount(currentReplicas int32, usageRatio float64, namespace string, selector labels.Selector) (replicaCount int32, timestamp time.Time, err error) {
	if currentReplicas != 0 {
		if math.Abs(1.0-usageRatio) <= c.tolerance {
			// return the current replicas if the change would be too small
			return currentReplicas, timestamp, nil
		}
		readyPodCount := int64(0)
		readyPodCount, err = c.getReadyPodsCount(namespace, selector)
		if err != nil {
			return 0, time.Time{}, fmt.Errorf("unable to calculate ready pods: %s", err)
		}
		replicaCount = int32(math.Ceil(usageRatio * float64(readyPodCount)))
	} else {
		// Scale to zero or n pods depending on usageRatio
		replicaCount = int32(math.Ceil(usageRatio))
	}

	return replicaCount, timestamp, err
}

What did you expect to happen?

Could you please help explain why to use the ready count and is there any way to refine the implement?

How can we reproduce it (as minimally and precisely as possible)?

Create a deployment with all unready pods. Or perform the performance test to a deployment and make all pods under it crash.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.12-86+f93862c0382718-dirty", GitCommit:"f93862c038271868c434c93cbae3d08e06ca281f", GitTreeState:"dirty", BuildDate:"2024-12-12T23:30:33Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

zheyli · 2025-02-13T06:26:44Z

/sig autoscaling

pacoxu · 2025-02-13T10:49:49Z

cc sig autoscaling maintainers: @gjtempleton @MaciekPytel

is this a bug or a known issue?

pacoxu · 2025-02-13T10:51:29Z

refer to #51650? (I will take a look tomorrow)

pacoxu · 2025-02-14T04:00:44Z

The logic were added in #60886.

Due to technical constraints, the HorizontalPodAutoscaler controller cannot exactly determine the first time a pod becomes ready when determining whether to set aside certain CPU metrics. Instead, it considers a Pod "not yet ready" if it's unready and transitioned to ready within a short, configurable window of time since it started. This value is configured with the --horizontal-pod-autoscaler-initial-readiness-delay flag, and its default is 30 seconds. Once a pod has become ready, it considers any transition to ready to be the first if it occurred within a longer, configurable time since it started. This value is configured with the --horizontal-pod-autoscaler-cpu-initialization-period flag, and its default is 5 minutes.

Meanwhile, https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details(quoted content) and #67252, there is --horizontal-pod-autoscaler-cpu-initialization-period. This may help in your case as well.

zheyli · 2025-02-15T04:07:08Z

Hi @pacoxu, thanks for your explanation. However, we are using external metrics to drive our scaling so I think --horizontal-pod-autoscaler-cpu-initialization-period would not be applicable to our use case.

zheyli · 2025-02-17T09:33:24Z

Hi @pacoxu Could you please help go on with this issue?

gjtempleton · 2025-02-17T09:52:15Z

I'd argue that this is a known issue/design choice rather than a bug, as you pointed out @pacoxu the behaviour for object and external metrics was set to be ~the same as with resource metrics by #60886 (#33593 was the original source of a lot of these choices).

Definitely a valid use case that we don't currently support nicely.

gjtempleton · 2025-02-17T09:52:34Z

/triage accepted

gjtempleton · 2025-02-17T09:55:09Z

/remove-triage accepted

zheyli · 2025-02-18T03:14:20Z

@gjtempleton Thanks for your explaining. Does your team have any plans to solve this problem?

sftim · 2025-02-18T20:41:49Z

@zheyli you can answer that question self-service: the backlog of issues is public, so you can search through them yourself.
Asking the maintainers to do that work for you is possible, but they may well decline.

sftim · 2025-02-18T20:42:54Z

Given #130130 (comment)
/remove-kind bug
/kind feature
/retitle Allow HPA to scale out when no matched Pods are ready

omerap12 · 2025-04-16T16:54:16Z

@gjtempleton
Just an idea: what if we use current replicas as fallback when there are no ready pods?

So instead of always using ready_pods (which becomes 0 in our problem case), we could do:

Normal case: use ready_pods (keep current behavior)
When no ready pods: use current_replicas

This way we don't get stuck at 0 when pods are unhealthy, and it should be pretty simple to implement.

adrianmoisey · 2025-05-10T10:13:58Z

@gjtempleton Just an idea: what if we use current replicas as fallback when there are no ready pods?

So instead of always using ready_pods (which becomes 0 in our problem case), we could do:

Normal case: use ready_pods (keep current behavior)

When no ready pods: use current_replicas

This way we don't get stuck at 0 when pods are unhealthy, and it should be pretty simple to implement.

Having a quick look at the code, this seems sane to me. I'm just not 100% sure if there are edge cases we need to be aware of, or if this will change the behaviour in a way that's not expected for the user.

omerap12 · 2025-05-10T13:40:37Z

Ill try to write something and see where it goes.
/assign

reganmcdonalds4 · 2025-05-20T17:30:03Z

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

omerap12 · 2025-05-21T08:53:04Z

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

No problem at all :)
EDIT: After reading #60886 again, I think letting the HPA count unready pods when deciding how many replicas to run could be a problem. If all pods are unready, how can the HPA know if it's because of high traffic or something else? The HPA doesn’t know why the pods are unready. If we scale based on unready pods, we might just create more "broken" pods and expand the "blast radius." For example, if the pods can’t connect to the database, adding more won’t fix the issue - it might even make it worse by putting more load on the database.

In this case, I think it’s better to trigger an alert for the team, rather than expecting the HPA to handle it on its own.
So, I agree with the idea in #60886 to ignore unready pods when scaling.

reganmcdonalds4 · 2025-05-21T16:16:49Z

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

No problem at all :) EDIT: After reading #60886 again, I think letting the HPA count unready pods when deciding how many replicas to run could be a problem. If all pods are unready, how can the HPA know if it's because of high traffic or something else? The HPA doesn’t know why the pods are unready. If we scale based on unready pods, we might just create more "broken" pods and expand the "blast radius." For example, if the pods can’t connect to the database, adding more won’t fix the issue - it might even make it worse by putting more load on the database.

In this case, I think it’s better to trigger an alert for the team, rather than expecting the HPA to handle it on its own. So, I agree with the idea in #60886 to ignore unready pods when scaling.

I agree with everything you're saying. This seems to be a major downside of using external metrics for HPA. The most common scenario we have seen for pods becoming unready is that our Puma workers are exhausted on several pods. The average of our external metric, Puma worker utilization, reaches the the scaling target but since there are unready pods HPA does not scale as quickly as it should. We have also had some Puma worker latency due to slow DB queries which has increased load on the DB, but in the end the interim solution was to scale up the DB until optimization could be performed, and HPA saved us from an actual outage. In the event pods can't connect to the DB, I would rather set alerting for that type of event than not have external metric based HPA not work properly.

omerap12 · 2025-05-21T17:46:03Z

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

No problem at all :) EDIT: After reading #60886 again, I think letting the HPA count unready pods when deciding how many replicas to run could be a problem. If all pods are unready, how can the HPA know if it's because of high traffic or something else? The HPA doesn’t know why the pods are unready. If we scale based on unready pods, we might just create more "broken" pods and expand the "blast radius." For example, if the pods can’t connect to the database, adding more won’t fix the issue - it might even make it worse by putting more load on the database.

In this case, I think it’s better to trigger an alert for the team, rather than expecting the HPA to handle it on its own. So, I agree with the idea in #60886 to ignore unready pods when scaling.

I agree with everything you're saying. This seems to be a major downside of using external metrics for HPA. The most common scenario we have seen for pods becoming unready is that our Puma workers are exhausted on several pods. The average of our external metric, Puma worker utilization, reaches the the scaling target but since there are unready pods HPA does not scale as quickly as it should. We have also had some Puma worker latency due to slow DB queries which has increased load on the DB, but in the end the interim solution was to scale up the DB until optimization could be performed, and HPA saved us from an actual outage. In the event pods can't connect to the DB, I would rather set alerting for that type of event than not have external metric based HPA not work properly.

Yeah, exactly — since the HPA doesn’t really know why pods are unready, scaling based on them could cause more problems, especially if the issue isn’t because of high load. I think it’s safer not to scale unready pods.

zheyli added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2025

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2025

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 13, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2025

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 17, 2025

k8s-ci-robot changed the title ~~[HorizontalPodAutoscaler] Cannot scale up when all pods are not ready~~ Allow HPA to scale out when no matched Pods are ready Feb 18, 2025

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 18, 2025

k8s-ci-robot assigned omerap12 May 10, 2025

omerap12 mentioned this issue May 16, 2025

Hpa fallback to current replicas #131819

Closed

Allow HPA to scale out when no matched Pods are ready #130130

Allow HPA to scale out when no matched Pods are ready #130130

Comments

zheyli commented Feb 13, 2025

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

zheyli commented Feb 13, 2025

Uh oh!

pacoxu commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacoxu commented Feb 13, 2025

Uh oh!

pacoxu commented Feb 14, 2025

Uh oh!

zheyli commented Feb 15, 2025

Uh oh!

zheyli commented Feb 17, 2025

Uh oh!

gjtempleton commented Feb 17, 2025

Uh oh!

gjtempleton commented Feb 17, 2025

Uh oh!

gjtempleton commented Feb 17, 2025

Uh oh!

zheyli commented Feb 18, 2025

Uh oh!

sftim commented Feb 18, 2025

Uh oh!

sftim commented Feb 18, 2025

Uh oh!

omerap12 commented Apr 16, 2025

Uh oh!

adrianmoisey commented May 10, 2025

Uh oh!

omerap12 commented May 10, 2025

Uh oh!

reganmcdonalds4 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omerap12 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reganmcdonalds4 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omerap12 commented May 21, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

pacoxu commented Feb 13, 2025 •

edited

Loading

reganmcdonalds4 commented May 20, 2025 •

edited

Loading

omerap12 commented May 21, 2025 •

edited

Loading

reganmcdonalds4 commented May 21, 2025 •

edited

Loading