Skip to content

apiserver OOM due to terminate all watchers for a specified crd cacher #123074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
likakuli opened this issue Feb 1, 2024 · 7 comments
Open
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@likakuli
Copy link
Member

likakuli commented Feb 1, 2024

What happened?

when crd spec changes, all watchers connected to the cacher of the crd will be terminated. this will result to informer watch from last RV again. the last RV is almost always less than the global RV after cacher recreated, so a "too old resource version" error is returned by cacher and informer will do relist operation which may result to kube-apiserver OOM.

What did you expect to happen?

  • send latest rv to watchers before terminate it
  • graceful terminate watchers when to recreate cache just like when to shutdown apiserver to avoid any thundering herd issues.

How can we reproduce it (as minimally and precisely as possible)?

  • create a crd and create some cr resources
  • update crd spec
  • then there will be some relist request

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
master

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@likakuli likakuli added the kind/bug Categorizes issue or PR as related to a bug. label Feb 1, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 1, 2024
@likakuli
Copy link
Member Author

likakuli commented Feb 1, 2024

@liggitt @tkashem i found some related issues and prs raised before, so i want to request your help, thx

@Ritikaa96
Copy link
Contributor

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2024
@liggitt
Copy link
Member

liggitt commented Feb 1, 2024

This isn't a bug, but is something that could be improved.

/remove-kind bug
/kind feature

To know the RV to send as a final bookmark, we would have to wait for the new cacher to be instantiated and synced before terminating the old watchers. That complicates the handoff between the old handler and new handler.

We also have to account for writes to the custom resource itself that are happening while the CRD is updated / the new cacher is instantiated / the new handler is set up / the old handler is terminated so that the RV we issue doesn't ever result in the re-established watcher missing results.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 1, 2024
@likakuli
Copy link
Member Author

likakuli commented Feb 2, 2024

Sorry, it's a feature

A complete solution to this issue would be quite complex. Is it possible to, as a temporary measure, send the latest resource version (rv) from the cacher back to the client before closing? Additionally, could we implement a rate limit during shutdown, similar to how kube-apiserver handles all watchers when to shutdown, to reduce the probability of encountering this problem.

@alexzielenski
Copy link
Member

/triage accepted
/cc @sttts @richabanker

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 20, 2024
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 19, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy