[Flaking test] [sig-node] Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails #129800

elieser1101 · 2025-01-24T11:11:21Z

Which jobs are flaking?

master-blocking

gce-ubuntu-master-containerd

Which tests are flaking?

Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails
Prow
Triage

Since when has it been flaking?

1/15/2025, 1:23:19 PM
1/20/2025, 7:25:30 PM
1/21/2025, 7:26:40 PM
1/22/2025, 1:24:08 AM
1/23/2025, 3:07:44 PM

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd

Reason for failure (if possible)

{ failed [FAILED] 1 errors:
pod pod-terminate-status-2-10 on node bootstrap-e2e-minion-group-5d4d container unexpected exit code 2: start=2025-01-23 18:26:43 +0000 UTC end=2025-01-23 18:26:44 +0000 UTC reason=Error message=
In [It] at: k8s.io/kubernetes/test/e2e/node/pods.go:548 @ 01/23/25 18:27:35.688
}

Anything else we need to know?

N/A

Relevant SIG(s)

/sig node
cc: @kubernetes/release-team-release-signal

The text was updated successfully, but these errors were encountered:

SergeyKanzhelev · 2025-01-29T18:11:42Z

/cc @gjkim42

@gjkim42 do you have any idea why it might fail?

SergeyKanzhelev · 2025-01-29T18:12:04Z

/triage accepted
/priority important-soon

elieser1101 · 2025-01-31T13:02:40Z

Hi, thanks for looking at it, do we know if this will block the v1.33.0-alpha.1 cut, which is scheduled for Tuesday, 4th February UTC?
@SergeyKanzhelev @haircommander

gjkim42 · 2025-01-31T16:02:24Z

I am not sure about the reason it failed before, but it seems ok now.

SergeyKanzhelev · 2025-01-31T22:00:57Z

failed again just very recently:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ec2-eks-al2023-arm64/1884549242454806528

So the issue is that the busybox with the single command/bin/false returned the exit code 2:

      "state": {
        "terminated": {
          "exitCode": 2,
          "reason": "Error",
          "startedAt": "2025-01-29T10:50:48Z",
          "finishedAt": "2025-01-29T10:50:48Z",
          "containerID": "containerd://b743082c19402aa1607f0964e223f5251819eb8073adbbf1301e5233f48b6fdf"
        }
      },

The chances are it is some sort of a runtime issue. Artifacts do not have any useful logs from the runtime unfortunately.

This failed job has some containerd files: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1882490406982127616

Error:

I0123 18:26:52.195399 10734 pods.go:771] pod pod-terminate-status-2-10 on node bootstrap-e2e-minion-group-5d4d had incorrect final status:

Tracking the containerid in containerd logs:

Jan 23 18:26:37.369510 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:37.369451698Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:pod-terminate-status-2-10,Uid:ae7ce95c-508a-4d74-96c5-0873eea8dea6,Namespace:pods-5749,Attempt:0,} returns sandbox id \"92f1adaeee87dd85f21b76fb015a9b40637c7a268d96ddc8f065a3cc49bb0486\""


Jan 23 18:26:43.137588 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.132060695Z" level=info msg="CreateContainer within sandbox \"92f1adaeee87dd85f21b76fb015a9b40637c7a268d96ddc8f065a3cc49bb0486\" for &ContainerMetadata{Name:fail,Attempt:0,} returns container id \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\""


Jan 23 18:26:43.143552 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.143380583Z" level=info msg="StartContainer for \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\""
Jan 23 18:26:43.217524 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.212512789Z" level=info msg="connecting to shim ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661" address="unix:///run/containerd/s/7cc1cc58cd14117a1051797b7b5fb87735284660b78c2554d76ab9387e8c52c2" protocol=ttrpc version=3
Jan 23 18:26:43.842105 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.841922353Z" level=info msg="StartContainer for \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" returns successfully"
Jan 23 18:26:44.585175 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:44.419958299Z" level=info msg="StopContainer for \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" with timeout 2 (s)"
Jan 23 18:26:44.585175 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:44.420704974Z" level=info msg="Stop container \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" with signal terminated"
Jan 23 18:26:44.851783 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:44.842406804Z" level=info msg="received exit event container_id:\"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" id:\"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" pid:96189 exit_status:2 exited_at:{seconds:1737656804 nanos:831734933}"

So the exit code is received after the container was terminated.

SergeyKanzhelev · 2025-01-31T23:13:59Z

This is updated comment, initial version had wrong logs inside

I was trying to repro by creating and deleting this pod:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting
spec:
  volumes:
  - name: rootfs
    hostPath:
      path: /
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.1
  initContainers:
  - name: fail
    image: registry.k8s.io/e2e-test-images/busybox@sha256:a9155b13325b2abef48e71de77bb8ac015412a566829f621d06bfae5c699b1b9
    command: ["/bin/sh", "-c", "sleep 25 && /bin/false"]
EOF

I am either getting exit code 1 or 137. Never 2 so far

samuelkarp · 2025-01-31T23:28:51Z

@SergeyKanzhelev The logs for the repro in #129800 (comment) look to be from some other container (&ContainerMetadata{Name:konnectivity-agent-metrics-collector,Attempt:0,}), not pause in the troubleshooting pod.

SergeyKanzhelev · 2025-01-31T23:33:42Z

oh, right. too many logs. Removing this comment, will try to repro again. However logs from the test execution seems to be correct

toVersus · 2025-02-01T00:10:17Z

It looks like this test case first failed with exit code 2 on November 27.
https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-29&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

SergeyKanzhelev · 2025-02-01T00:25:19Z

It looks like this test case first failed with exit code 2 on November 27. https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-29&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

I cannot find any relevant change in either k/k not test-infra. Just a minor image bump.

SergeyKanzhelev · 2025-02-01T00:26:58Z

I wonder if busybox may have some logic that would return exit code 2 on sigterm. This seems to be more likely than containerd issue.

elieser1101 · 2025-02-03T10:34:16Z

Hi folks thanks for the help, is there any input on if this is a blocker for the v1.33.0-alpha.1 cut happening tomorrow Tuesday feb 4th?
cc: @SergeyKanzhelev @samuelkarp @haircommander

stmcginnis · 2025-02-20T20:47:47Z

No more failures showing up in testgrid, appears to be resolved.

gjkim42 · 2025-02-21T01:23:09Z

Do we have any idea what made it fail and what fixed it?

toVersus · 2025-02-21T01:40:56Z

AFAIK, no action has been taken from our side. The last observation was on February 14, and it does not seem to have occurred since then. I'll check if any changes that might have had an impact were made after February 14.
https://storage.googleapis.com/k8s-triage/index.html?date=2025-02-20&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

toVersus · 2025-02-24T02:25:51Z

I've looked into the changes made since 2/14, but I didn't find anything relevant. Also, the same error occurred again yesterday, so it seems that it hasn't actually been fixed.
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-containerd-e2e-ubuntu-gce/1893575816156024832

toVersus · 2025-02-24T02:41:13Z

I have created #130383 to help narrow down the cause based on the assumption #129800 (comment).

haircommander · 2025-02-26T18:31:47Z

@toVersus would you like to be assigned this?

toVersus · 2025-02-27T00:14:09Z

Yes, if there isn’t anyone more suitable, I’ll continue handling it.

toVersus · 2025-02-27T14:40:09Z

/assign

toVersus · 2025-04-01T01:56:40Z

Even after #130383 was merged, the exit code 2 error still occurs. It doesn't seem to be caused by the base OS of the container image.
https://storage.googleapis.com/k8s-triage/index.html?date=2025-03-31&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

As Sergey pointed out in #129800 (comment), the fact that the container runtime returns exit code 2 when terminated the container remains unchanged.
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kind-beta-features/1905801627755876352/artifacts/kind-worker/containerd.log

Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.008343558Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:pod-terminate-status-1-13,Uid:f63e3be4-84b0-47db-a543-9a919035ab56,Namespace:pods-2803,Attempt:0,} returns sandbox id \"da02ff327c7d452d6fc6944385b17358a802b81a7fad48cc51526a8eeb4cfacf\""
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.073735382Z" level=info msg="CreateContainer within sandbox \"da02ff327c7d452d6fc6944385b17358a802b81a7fad48cc51526a8eeb4cfacf\" for &ContainerMetadata{Name:fail,Attempt:0,} returns container id \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\""
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.299036490Z" level=info msg="StartContainer for \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" returns successfully"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.320168675Z" level=info msg="StopContainer for \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" with timeout 2 (s)"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.320928752Z" level=info msg="Stop container \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" with signal terminated"
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.395960909Z" level=info msg="received exit event container_id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  pid:193518  exit_status:2  exited_at:{seconds:1743214912  nanos:395373051}"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.396888084Z" level=info msg="TaskExit event in podsandbox handler container_id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  pid:193518  exit_status:2  exited_at:{seconds:1743214912  nanos:395373051}"

We tested a container that exits with code 1 in both regular containers and init containers, and only the init container sometimes exits with code 2. I don't understand why this only happens in the init container.

kubernetes/test/e2e/node/pods.go

Lines 213 to 228 in beef784

    
           ginkgo.It("should never report success for a pending container", func(ctx context.Context) { 
        
           	ginkgo.By("creating pods that should always exit 1 and terminating the pod after a random delay") 
        
           	createAndTestPodRepeatedly(ctx, 
        
           		3, 15, 
        
           		podFastDeleteScenario{client: podClient.PodInterface, delayMs: 2000}, 
        
           		podClient.PodInterface, 
        
           	) 
        
           }) 
        
           ginkgo.It("should never report container start when an init container fails", func(ctx context.Context) { 
        
           	ginkgo.By("creating pods with an init container that always exit 1 and terminating the pod after a random delay") 
        
           	createAndTestPodRepeatedly(ctx, 
        
           		3, 15, 
        
           		podFastDeleteScenario{client: podClient.PodInterface, delayMs: 2000, initContainer: true}, 
        
           		podClient.PodInterface, 
        
           	) 
        
           })

wendy-ha18 · 2025-05-12T13:11:00Z

Hi folks, thanks for your support and attention on this issue!
The release cycle for v1.34 will start soon. Since this is still open, I will carry it over to the latest milestone.

/milestone 1.34

k8s-ci-robot · 2025-05-12T13:11:03Z

@wendy-ha18: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.26, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]

Use /milestone clear to clear the milestone.

In response to this:

Hi folks, thanks for your support and attention on this issue!
The release cycle for v1.34 will start soon. Since this is still open, I will carry it over to the latest milestone.

/milestone 1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wendy-ha18 · 2025-05-12T13:11:17Z

/milestone v1.34

Rajalakshmi-Girish · 2025-05-21T12:22:00Z

Most recently failed on 19thMay with same error message as in description:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1924496663112585216
Hence having this issue on CI Signal Board for v1.34 release too!

Rajalakshmi-Girish · 2025-05-21T12:26:18Z

@toVersus ^^

elieser1101 added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 24, 2025

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 24, 2025

github-project-automation bot added this to SIG Node CI/Test Board Jan 24, 2025

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Jan 24, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 24, 2025

elieser1101 moved this to FLAKY in CI Signal (SIG Release / Release Team) Jan 24, 2025

elieser1101 added this to CI Signal (SIG Release / Release Team) Jan 24, 2025

SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Jan 29, 2025

stmcginnis moved this from FLAKY to PASSING in CI Signal (SIG Release / Release Team) Feb 20, 2025

toVersus mentioned this issue Feb 24, 2025

[e2e/node] update base image from busybox to agnhost #130383

Merged

wendy-ha18 moved this from PASSING to FLAKY in CI Signal (SIG Release / Release Team) Feb 25, 2025

k8s-ci-robot assigned toVersus Feb 27, 2025

SergeyKanzhelev moved this from Issues - To do to Issues - In progress in SIG Node CI/Test Board Apr 23, 2025

pacoxu mentioned this issue Apr 30, 2025

jwt: refactor CEL eval to drop unstructured and map[string]any #131536

Merged

k8s-ci-robot added this to the v1.34 milestone May 12, 2025

wendy-ha18 removed this from CI Signal (SIG Release / Release Team) May 19, 2025

Rajalakshmi-Girish moved this to FLAKY in CI Signal (SIG Release / Release Team) May 19, 2025

Rajalakshmi-Girish added this to CI Signal (SIG Release / Release Team) May 19, 2025

k8s-infra-ci-robot added this to [sig-release] Bug Triage May 23, 2025

Rajalakshmi-Girish moved this to Pending inclusion in [sig-release] Bug Triage May 23, 2025

[Flaking test] [sig-node] Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails #129800

[Flaking test] [sig-node] Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails #129800

Comments

elieser1101 commented Jan 24, 2025

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

SergeyKanzhelev commented Jan 29, 2025

Uh oh!

SergeyKanzhelev commented Jan 29, 2025

Uh oh!

elieser1101 commented Jan 31, 2025

Uh oh!

gjkim42 commented Jan 31, 2025

Uh oh!

SergeyKanzhelev commented Jan 31, 2025

Uh oh!

SergeyKanzhelev commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuelkarp commented Jan 31, 2025

Uh oh!

SergeyKanzhelev commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toVersus commented Feb 1, 2025

Uh oh!

SergeyKanzhelev commented Feb 1, 2025

Uh oh!

SergeyKanzhelev commented Feb 1, 2025

Uh oh!

elieser1101 commented Feb 3, 2025

Uh oh!

stmcginnis commented Feb 20, 2025

Uh oh!

gjkim42 commented Feb 21, 2025

Uh oh!

toVersus commented Feb 21, 2025

Uh oh!

toVersus commented Feb 24, 2025

Uh oh!

toVersus commented Feb 24, 2025

Uh oh!

haircommander commented Feb 26, 2025

Uh oh!

toVersus commented Feb 27, 2025

Uh oh!

toVersus commented Feb 27, 2025

Uh oh!

toVersus commented Apr 1, 2025

Uh oh!

wendy-ha18 commented May 12, 2025

Uh oh!

k8s-ci-robot commented May 12, 2025

Uh oh!

wendy-ha18 commented May 12, 2025

Uh oh!

Rajalakshmi-Girish commented May 21, 2025

Uh oh!

Rajalakshmi-Girish commented May 21, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

SergeyKanzhelev commented Jan 31, 2025 •

edited

Loading

SergeyKanzhelev commented Jan 31, 2025 •

edited

Loading