Skip to content

Allow HPA to scale out when no matched Pods are ready #130130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zheyli opened this issue Feb 13, 2025 · 19 comments
Open

Allow HPA to scale out when no matched Pods are ready #130130

zheyli opened this issue Feb 13, 2025 · 19 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@zheyli
Copy link

zheyli commented Feb 13, 2025

What happened?

One of our production pool cannot scale up when its metrics reached its threshold because all pods became all unready at that moment when there was a peak traffic. Dig into the source code, we found that HPA calculate the desired replica using the ready pod count so cause the recommend replica is always 0.

func (c *ReplicaCalculator) getUsageRatioReplicaCount(currentReplicas int32, usageRatio float64, namespace string, selector labels.Selector) (replicaCount int32, timestamp time.Time, err error) {
	if currentReplicas != 0 {
		if math.Abs(1.0-usageRatio) <= c.tolerance {
			// return the current replicas if the change would be too small
			return currentReplicas, timestamp, nil
		}
		readyPodCount := int64(0)
		readyPodCount, err = c.getReadyPodsCount(namespace, selector)
		if err != nil {
			return 0, time.Time{}, fmt.Errorf("unable to calculate ready pods: %s", err)
		}
		replicaCount = int32(math.Ceil(usageRatio * float64(readyPodCount)))
	} else {
		// Scale to zero or n pods depending on usageRatio
		replicaCount = int32(math.Ceil(usageRatio))
	}

	return replicaCount, timestamp, err
}

What did you expect to happen?

Could you please help explain why to use the ready count and is there any way to refine the implement?

How can we reproduce it (as minimally and precisely as possible)?

Create a deployment with all unready pods. Or perform the performance test to a deployment and make all pods under it crash.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.12-86+f93862c0382718-dirty", GitCommit:"f93862c038271868c434c93cbae3d08e06ca281f", GitTreeState:"dirty", BuildDate:"2024-12-12T23:30:33Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@zheyli zheyli added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2025
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2025
@zheyli
Copy link
Author

zheyli commented Feb 13, 2025

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 13, 2025
@pacoxu
Copy link
Member

pacoxu commented Feb 13, 2025

cc sig autoscaling maintainers: @gjtempleton @MaciekPytel

is this a bug or a known issue?

@pacoxu
Copy link
Member

pacoxu commented Feb 13, 2025

refer to #51650? (I will take a look tomorrow)

@pacoxu
Copy link
Member

pacoxu commented Feb 14, 2025

The logic were added in #60886.

Due to technical constraints, the HorizontalPodAutoscaler controller cannot exactly determine the first time a pod becomes ready when determining whether to set aside certain CPU metrics. Instead, it considers a Pod "not yet ready" if it's unready and transitioned to ready within a short, configurable window of time since it started. This value is configured with the --horizontal-pod-autoscaler-initial-readiness-delay flag, and its default is 30 seconds. Once a pod has become ready, it considers any transition to ready to be the first if it occurred within a longer, configurable time since it started. This value is configured with the --horizontal-pod-autoscaler-cpu-initialization-period flag, and its default is 5 minutes.

Meanwhile, https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details(quoted content) and #67252, there is --horizontal-pod-autoscaler-cpu-initialization-period. This may help in your case as well.

@zheyli
Copy link
Author

zheyli commented Feb 15, 2025

Hi @pacoxu, thanks for your explanation. However, we are using external metrics to drive our scaling so I think --horizontal-pod-autoscaler-cpu-initialization-period would not be applicable to our use case.

@zheyli
Copy link
Author

zheyli commented Feb 17, 2025

Hi @pacoxu Could you please help go on with this issue?

@gjtempleton
Copy link
Member

I'd argue that this is a known issue/design choice rather than a bug, as you pointed out @pacoxu the behaviour for object and external metrics was set to be ~the same as with resource metrics by #60886 (#33593 was the original source of a lot of these choices).

Definitely a valid use case that we don't currently support nicely.

@gjtempleton
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2025
@gjtempleton
Copy link
Member

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 17, 2025
@zheyli
Copy link
Author

zheyli commented Feb 18, 2025

@gjtempleton Thanks for your explaining. Does your team have any plans to solve this problem?

@sftim
Copy link
Contributor

sftim commented Feb 18, 2025

@zheyli you can answer that question self-service: the backlog of issues is public, so you can search through them yourself.
Asking the maintainers to do that work for you is possible, but they may well decline.

@sftim
Copy link
Contributor

sftim commented Feb 18, 2025

Given #130130 (comment)
/remove-kind bug
/kind feature
/retitle Allow HPA to scale out when no matched Pods are ready

@k8s-ci-robot k8s-ci-robot changed the title [HorizontalPodAutoscaler] Cannot scale up when all pods are not ready Allow HPA to scale out when no matched Pods are ready Feb 18, 2025
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 18, 2025
@omerap12
Copy link
Member

@gjtempleton
Just an idea: what if we use current replicas as fallback when there are no ready pods?

So instead of always using ready_pods (which becomes 0 in our problem case), we could do:

  • Normal case: use ready_pods (keep current behavior)
  • When no ready pods: use current_replicas

This way we don't get stuck at 0 when pods are unhealthy, and it should be pretty simple to implement.

@adrianmoisey
Copy link
Member

@gjtempleton Just an idea: what if we use current replicas as fallback when there are no ready pods?

So instead of always using ready_pods (which becomes 0 in our problem case), we could do:

  • Normal case: use ready_pods (keep current behavior)
  • When no ready pods: use current_replicas

This way we don't get stuck at 0 when pods are unhealthy, and it should be pretty simple to implement.

Having a quick look at the code, this seems sane to me. I'm just not 100% sure if there are edge cases we need to be aware of, or if this will change the behaviour in a way that's not expected for the user.

@omerap12
Copy link
Member

Ill try to write something and see where it goes.
/assign

@reganmcdonalds4
Copy link

reganmcdonalds4 commented May 20, 2025

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

@omerap12
Copy link
Member

omerap12 commented May 21, 2025

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

No problem at all :)
EDIT: After reading #60886 again, I think letting the HPA count unready pods when deciding how many replicas to run could be a problem. If all pods are unready, how can the HPA know if it's because of high traffic or something else? The HPA doesn’t know why the pods are unready. If we scale based on unready pods, we might just create more "broken" pods and expand the "blast radius." For example, if the pods can’t connect to the database, adding more won’t fix the issue - it might even make it worse by putting more load on the database.

In this case, I think it’s better to trigger an alert for the team, rather than expecting the HPA to handle it on its own.
So, I agree with the idea in #60886 to ignore unready pods when scaling.

@reganmcdonalds4
Copy link

reganmcdonalds4 commented May 21, 2025

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

No problem at all :) EDIT: After reading #60886 again, I think letting the HPA count unready pods when deciding how many replicas to run could be a problem. If all pods are unready, how can the HPA know if it's because of high traffic or something else? The HPA doesn’t know why the pods are unready. If we scale based on unready pods, we might just create more "broken" pods and expand the "blast radius." For example, if the pods can’t connect to the database, adding more won’t fix the issue - it might even make it worse by putting more load on the database.

In this case, I think it’s better to trigger an alert for the team, rather than expecting the HPA to handle it on its own. So, I agree with the idea in #60886 to ignore unready pods when scaling.

I agree with everything you're saying. This seems to be a major downside of using external metrics for HPA. The most common scenario we have seen for pods becoming unready is that our Puma workers are exhausted on several pods. The average of our external metric, Puma worker utilization, reaches the the scaling target but since there are unready pods HPA does not scale as quickly as it should. We have also had some Puma worker latency due to slow DB queries which has increased load on the DB, but in the end the interim solution was to scale up the DB until optimization could be performed, and HPA saved us from an actual outage. In the event pods can't connect to the DB, I would rather set alerting for that type of event than not have external metric based HPA not work properly.

@omerap12
Copy link
Member

Hi @omerap12, thank you for taking this on. Using current replicas as a fallback during an extreme case of all pods becoming unready can prevent the emergency situation of being stuck at zero pods, but I think users utilizing external metrics would benefit from having the option to use replica count instead of ready pods from the start. In the scenario where pods start becoming unready due to an increase in external metric utilization, since HPA does not account for unready pods, we do not see the scaling behavior we would expect. HPA does slowly scale up, but not nearly as aggressively as it should according to the scale up policy. Instead utilization continues to cause pods to become unready until there is potentially service degradation or interruption.

No problem at all :) EDIT: After reading #60886 again, I think letting the HPA count unready pods when deciding how many replicas to run could be a problem. If all pods are unready, how can the HPA know if it's because of high traffic or something else? The HPA doesn’t know why the pods are unready. If we scale based on unready pods, we might just create more "broken" pods and expand the "blast radius." For example, if the pods can’t connect to the database, adding more won’t fix the issue - it might even make it worse by putting more load on the database.

In this case, I think it’s better to trigger an alert for the team, rather than expecting the HPA to handle it on its own. So, I agree with the idea in #60886 to ignore unready pods when scaling.

I agree with everything you're saying. This seems to be a major downside of using external metrics for HPA. The most common scenario we have seen for pods becoming unready is that our Puma workers are exhausted on several pods. The average of our external metric, Puma worker utilization, reaches the the scaling target but since there are unready pods HPA does not scale as quickly as it should. We have also had some Puma worker latency due to slow DB queries which has increased load on the DB, but in the end the interim solution was to scale up the DB until optimization could be performed, and HPA saved us from an actual outage. In the event pods can't connect to the DB, I would rather set alerting for that type of event than not have external metric based HPA not work properly.

Yeah, exactly — since the HPA doesn’t really know why pods are unready, scaling based on them could cause more problems, especially if the issue isn’t because of high load. I think it’s safer not to scale unready pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy