-
Notifications
You must be signed in to change notification settings - Fork 40.6k
[Flaking test] [sig-node] Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails #129800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/triage accepted |
Hi, thanks for looking at it, do we know if this will block the v1.33.0-alpha.1 cut, which is scheduled for Tuesday, 4th February UTC? |
I am not sure about the reason it failed before, but it seems ok now. |
failed again just very recently: So the issue is that the
The chances are it is some sort of a runtime issue. Artifacts do not have any useful logs from the runtime unfortunately. This failed job has some containerd files: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1882490406982127616 Error:
Tracking the containerid in containerd logs:
So the exit code is received after the container was terminated. |
This is updated comment, initial version had wrong logs inside I was trying to repro by creating and deleting this pod:
I am either getting exit code |
@SergeyKanzhelev The logs for the repro in #129800 (comment) look to be from some other container ( |
oh, right. too many logs. Removing this comment, will try to repro again. However logs from the test execution seems to be correct |
It looks like this test case first failed with exit code 2 on November 27. |
I cannot find any relevant change in either k/k not test-infra. Just a minor image bump. |
I wonder if |
Hi folks thanks for the help, is there any input on if this is a blocker for the |
No more failures showing up in testgrid, appears to be resolved. |
Do we have any idea what made it fail and what fixed it? |
AFAIK, no action has been taken from our side. The last observation was on February 14, and it does not seem to have occurred since then. I'll check if any changes that might have had an impact were made after February 14. |
I've looked into the changes made since 2/14, but I didn't find anything relevant. Also, the same error occurred again yesterday, so it seems that it hasn't actually been fixed. |
I have created #130383 to help narrow down the cause based on the assumption #129800 (comment). |
@toVersus would you like to be assigned this? |
Yes, if there isn’t anyone more suitable, I’ll continue handling it. |
/assign |
Even after #130383 was merged, the exit code 2 error still occurs. It doesn't seem to be caused by the base OS of the container image. As Sergey pointed out in #129800 (comment), the fact that the container runtime returns exit code 2 when terminated the container remains unchanged. Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.008343558Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:pod-terminate-status-1-13,Uid:f63e3be4-84b0-47db-a543-9a919035ab56,Namespace:pods-2803,Attempt:0,} returns sandbox id \"da02ff327c7d452d6fc6944385b17358a802b81a7fad48cc51526a8eeb4cfacf\""
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.073735382Z" level=info msg="CreateContainer within sandbox \"da02ff327c7d452d6fc6944385b17358a802b81a7fad48cc51526a8eeb4cfacf\" for &ContainerMetadata{Name:fail,Attempt:0,} returns container id \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\""
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.299036490Z" level=info msg="StartContainer for \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" returns successfully"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.320168675Z" level=info msg="StopContainer for \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" with timeout 2 (s)"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.320928752Z" level=info msg="Stop container \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" with signal terminated"
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.395960909Z" level=info msg="received exit event container_id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" pid:193518 exit_status:2 exited_at:{seconds:1743214912 nanos:395373051}"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.396888084Z" level=info msg="TaskExit event in podsandbox handler container_id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" pid:193518 exit_status:2 exited_at:{seconds:1743214912 nanos:395373051}" We tested a container that exits with code 1 in both regular containers and init containers, and only the init container sometimes exits with code 2. I don't understand why this only happens in the init container. kubernetes/test/e2e/node/pods.go Lines 213 to 228 in beef784
|
Hi folks, thanks for your support and attention on this issue! /milestone 1.34 |
@wendy-ha18: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/milestone v1.34 |
Most recently failed on 19thMay with same error message as in description: |
@toVersus ^^ |
Which jobs are flaking?
master-blocking
Which tests are flaking?
Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails
Prow
Triage
Since when has it been flaking?
1/15/2025, 1:23:19 PM
1/20/2025, 7:25:30 PM
1/21/2025, 7:26:40 PM
1/22/2025, 1:24:08 AM
1/23/2025, 3:07:44 PM
Testgrid link
https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd
Reason for failure (if possible)
Anything else we need to know?
N/A
Relevant SIG(s)
/sig node
cc: @kubernetes/release-team-release-signal
The text was updated successfully, but these errors were encountered: