apiserver OOM due to terminate all watchers for a specified crd cacher #123074

likakuli · 2024-02-01T13:53:01Z

What happened?

when crd spec changes, all watchers connected to the cacher of the crd will be terminated. this will result to informer watch from last RV again. the last RV is almost always less than the global RV after cacher recreated, so a "too old resource version" error is returned by cacher and informer will do relist operation which may result to kube-apiserver OOM.

What did you expect to happen?

send latest rv to watchers before terminate it
graceful terminate watchers when to recreate cache just like when to shutdown apiserver to avoid any thundering herd issues.

How can we reproduce it (as minimally and precisely as possible)?

create a crd and create some cr resources
update crd spec
then there will be some relist request

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
master

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

likakuli · 2024-02-01T13:55:56Z

@liggitt @tkashem i found some related issues and prs raised before, so i want to request your help, thx

Ritikaa96 · 2024-02-01T16:21:51Z

/sig api-machinery

liggitt · 2024-02-01T16:27:13Z

This isn't a bug, but is something that could be improved.

/remove-kind bug
/kind feature

To know the RV to send as a final bookmark, we would have to wait for the new cacher to be instantiated and synced before terminating the old watchers. That complicates the handoff between the old handler and new handler.

We also have to account for writes to the custom resource itself that are happening while the CRD is updated / the new cacher is instantiated / the new handler is set up / the old handler is terminated so that the RV we issue doesn't ever result in the re-established watcher missing results.

likakuli · 2024-02-02T02:54:44Z

Sorry, it's a feature

A complete solution to this issue would be quite complex. Is it possible to, as a temporary measure, send the latest resource version (rv) from the cacher back to the client before closing? Additionally, could we implement a rate limit during shutdown, similar to how kube-apiserver handles all watchers when to shutdown, to reduce the probability of encountering this problem.

alexzielenski · 2024-02-20T20:38:37Z

/triage accepted
/cc @sttts @richabanker

k8s-triage-robot · 2025-02-19T21:05:10Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2025-05-20T21:19:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

likakuli added the kind/bug Categorizes issue or PR as related to a bug. label Feb 1, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 1, 2024

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2024

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 1, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 20, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 19, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apiserver OOM due to terminate all watchers for a specified crd cacher #123074

apiserver OOM due to terminate all watchers for a specified crd cacher #123074

likakuli commented Feb 1, 2024

likakuli commented Feb 1, 2024

Uh oh!

Ritikaa96 commented Feb 1, 2024

Uh oh!

liggitt commented Feb 1, 2024

Uh oh!

likakuli commented Feb 2, 2024

Uh oh!

alexzielenski commented Feb 20, 2024

Uh oh!

k8s-triage-robot commented Feb 19, 2025

Uh oh!

k8s-triage-robot commented May 20, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

apiserver OOM due to terminate all watchers for a specified crd cacher #123074

apiserver OOM due to terminate all watchers for a specified crd cacher #123074

Comments

likakuli commented Feb 1, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

likakuli commented Feb 1, 2024

Uh oh!

Ritikaa96 commented Feb 1, 2024

Uh oh!

liggitt commented Feb 1, 2024

Uh oh!

likakuli commented Feb 2, 2024

Uh oh!

alexzielenski commented Feb 20, 2024

Uh oh!

k8s-triage-robot commented Feb 19, 2025

Uh oh!

k8s-triage-robot commented May 20, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.