-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still seeing the issue for endpoints staying out of sync #126578
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig network |
@kedar700 from the logs, endpoint controller was retrying test1/test-qa in the expected way. It has retried 5 times and the next retry should be after 320ms, which should long enough for the informer cache to get the updated endpoints. If there is no more logs related to the failure, I think it may have synced it successfully. The python script used to detect the failure seems capturing diff between endpoints and endpointslice at a moment, but that could happen if there is any Pod change in the cluster, and technically the script is comparing endpoints of one moment with endpointslice of another moment. So did you see the endpoints not being synced after endpoint controller stopped retrying it? |
Yeah i still see it and see similar messages continuously. Let me post some additional log messages after the one i posted above. |
Observed the same issue in 1.29 and 1.30. |
/assign @tnqn Are you able to triage this since you are already looking at this? |
Yes i can get that over by EOD today |
Sure, I'm waiting for more logs and the content of the endpoints to understand how it happened. Currently there are only retry logs which is expected in some cases. |
@tnqn They have a mutatingWebhook on Endpoints object that automatically injects a label, let's say
I think we should fix this by check the resulting resourceVersion from the update request, and only mark the old resourceVersion as stale when a new resourceVersion is generated. Not sure whether this is the same case others in this thread encountered :D |
@M00nF1sh thanks for explaining the use case and the analysis. I did analyze the code after the issue was opened and suspected the only possibility is the returned resource version somehow equals to the one in the request but didn't know how it could happen after it has checked the desired endpoints and the current endpoints are different: kubernetes/pkg/controller/endpoint/endpoints_controller.go Lines 495 to 503 in 95b3fe9
Now I understand this could happen when the endpoints appear different on client side but are mutated to be same on server side. Do you plan to create a PR since you have proposed a fix? @kedar700 @mengqiy could you confirm if this is also your case, i.e. there is a mutatingWebhook for endpoints objects that could make the endpoints different from what endpoints controller expects? |
@tnqn |
/assign @M00nF1sh |
What happened?
This issue #125638 was supposed to have fixed the issue where endpoint stay out of sync
I also wrote a small script which would get me the out of sync endpoints compared to the endpointslices
What did you expect to happen?
I expect the endpoints to eventually sync and reflect the most upto date information.
How can we reproduce it (as minimally and precisely as possible)?
I have just deployed the newer patch to our cluster and that has resulted in endpoints never ending up being updated if the status goes out of sync.
Anything else we need to know?
No response
Kubernetes version
Client Version: v1.29.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7
Cloud provider
OS version
almalinux-9
Install tools
Container runtime (CRI) and version (if applicable)
cri-o
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: