Skip to content

[Flaking test] [sig-node] Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails #129800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
elieser1101 opened this issue Jan 24, 2025 · 26 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@elieser1101
Copy link
Contributor

Which jobs are flaking?

master-blocking

  • gce-ubuntu-master-containerd

Which tests are flaking?

Kubernetes e2e suite.[It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails
Prow
Triage

Since when has it been flaking?

1/15/2025, 1:23:19 PM
1/20/2025, 7:25:30 PM
1/21/2025, 7:26:40 PM
1/22/2025, 1:24:08 AM
1/23/2025, 3:07:44 PM

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd

Reason for failure (if possible)

{ failed [FAILED] 1 errors:
pod pod-terminate-status-2-10 on node bootstrap-e2e-minion-group-5d4d container unexpected exit code 2: start=2025-01-23 18:26:43 +0000 UTC end=2025-01-23 18:26:44 +0000 UTC reason=Error message=
In [It] at: k8s.io/kubernetes/test/e2e/node/pods.go:548 @ 01/23/25 18:27:35.688
}

Anything else we need to know?

N/A

Relevant SIG(s)

/sig node
cc: @kubernetes/release-team-release-signal

@elieser1101 elieser1101 added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 24, 2025
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 24, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 24, 2025
@SergeyKanzhelev
Copy link
Member

/cc @gjkim42

@gjkim42 do you have any idea why it might fail?

@SergeyKanzhelev
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 29, 2025
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Jan 29, 2025
@elieser1101
Copy link
Contributor Author

Hi, thanks for looking at it, do we know if this will block the v1.33.0-alpha.1 cut, which is scheduled for Tuesday, 4th February UTC?
@SergeyKanzhelev @haircommander

@gjkim42
Copy link
Member

gjkim42 commented Jan 31, 2025

I am not sure about the reason it failed before, but it seems ok now.

@SergeyKanzhelev
Copy link
Member

failed again just very recently:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ec2-eks-al2023-arm64/1884549242454806528

So the issue is that the busybox with the single command/bin/false returned the exit code 2:

      "state": {
        "terminated": {
          "exitCode": 2,
          "reason": "Error",
          "startedAt": "2025-01-29T10:50:48Z",
          "finishedAt": "2025-01-29T10:50:48Z",
          "containerID": "containerd://b743082c19402aa1607f0964e223f5251819eb8073adbbf1301e5233f48b6fdf"
        }
      },

The chances are it is some sort of a runtime issue. Artifacts do not have any useful logs from the runtime unfortunately.

This failed job has some containerd files: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1882490406982127616

Error:

I0123 18:26:52.195399 10734 pods.go:771] pod pod-terminate-status-2-10 on node bootstrap-e2e-minion-group-5d4d had incorrect final status:

Tracking the containerid in containerd logs:

Jan 23 18:26:37.369510 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:37.369451698Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:pod-terminate-status-2-10,Uid:ae7ce95c-508a-4d74-96c5-0873eea8dea6,Namespace:pods-5749,Attempt:0,} returns sandbox id \"92f1adaeee87dd85f21b76fb015a9b40637c7a268d96ddc8f065a3cc49bb0486\""


Jan 23 18:26:43.137588 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.132060695Z" level=info msg="CreateContainer within sandbox \"92f1adaeee87dd85f21b76fb015a9b40637c7a268d96ddc8f065a3cc49bb0486\" for &ContainerMetadata{Name:fail,Attempt:0,} returns container id \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\""


Jan 23 18:26:43.143552 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.143380583Z" level=info msg="StartContainer for \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\""
Jan 23 18:26:43.217524 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.212512789Z" level=info msg="connecting to shim ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661" address="unix:///run/containerd/s/7cc1cc58cd14117a1051797b7b5fb87735284660b78c2554d76ab9387e8c52c2" protocol=ttrpc version=3
Jan 23 18:26:43.842105 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:43.841922353Z" level=info msg="StartContainer for \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" returns successfully"
Jan 23 18:26:44.585175 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:44.419958299Z" level=info msg="StopContainer for \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" with timeout 2 (s)"
Jan 23 18:26:44.585175 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:44.420704974Z" level=info msg="Stop container \"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" with signal terminated"
Jan 23 18:26:44.851783 bootstrap-e2e-minion-group-5d4d containerd[8911]: time="2025-01-23T18:26:44.842406804Z" level=info msg="received exit event container_id:\"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" id:\"ba44a9e5e18576df52e4df58af334cfdb7ca2929e0b726bae9c5f427ff60e661\" pid:96189 exit_status:2 exited_at:{seconds:1737656804 nanos:831734933}"

So the exit code is received after the container was terminated.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jan 31, 2025

This is updated comment, initial version had wrong logs inside

I was trying to repro by creating and deleting this pod:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting
spec:
  volumes:
  - name: rootfs
    hostPath:
      path: /
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.1
  initContainers:
  - name: fail
    image: registry.k8s.io/e2e-test-images/busybox@sha256:a9155b13325b2abef48e71de77bb8ac015412a566829f621d06bfae5c699b1b9
    command: ["/bin/sh", "-c", "sleep 25 && /bin/false"]
EOF

I am either getting exit code 1 or 137. Never 2 so far

@samuelkarp
Copy link
Member

@SergeyKanzhelev The logs for the repro in #129800 (comment) look to be from some other container (&ContainerMetadata{Name:konnectivity-agent-metrics-collector,Attempt:0,}), not pause in the troubleshooting pod.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jan 31, 2025

oh, right. too many logs. Removing this comment, will try to repro again. However logs from the test execution seems to be correct

@toVersus
Copy link
Contributor

toVersus commented Feb 1, 2025

@SergeyKanzhelev
Copy link
Member

It looks like this test case first failed with exit code 2 on November 27. https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-29&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

I cannot find any relevant change in either k/k not test-infra. Just a minor image bump.

@SergeyKanzhelev
Copy link
Member

I wonder if busybox may have some logic that would return exit code 2 on sigterm. This seems to be more likely than containerd issue.

@elieser1101
Copy link
Contributor Author

Hi folks thanks for the help, is there any input on if this is a blocker for the v1.33.0-alpha.1 cut happening tomorrow Tuesday feb 4th?
cc: @SergeyKanzhelev @samuelkarp @haircommander

@stmcginnis stmcginnis moved this from FLAKY to PASSING in CI Signal (SIG Release / Release Team) Feb 20, 2025
@stmcginnis
Copy link
Contributor

No more failures showing up in testgrid, appears to be resolved.

@gjkim42
Copy link
Member

gjkim42 commented Feb 21, 2025

Do we have any idea what made it fail and what fixed it?

@toVersus
Copy link
Contributor

AFAIK, no action has been taken from our side. The last observation was on February 14, and it does not seem to have occurred since then. I'll check if any changes that might have had an impact were made after February 14.
https://storage.googleapis.com/k8s-triage/index.html?date=2025-02-20&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

@toVersus
Copy link
Contributor

I've looked into the changes made since 2/14, but I didn't find anything relevant. Also, the same error occurred again yesterday, so it seems that it hasn't actually been fixed.
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-containerd-e2e-ubuntu-gce/1893575816156024832

@toVersus
Copy link
Contributor

I have created #130383 to help narrow down the cause based on the assumption #129800 (comment).

@wendy-ha18 wendy-ha18 moved this from PASSING to FLAKY in CI Signal (SIG Release / Release Team) Feb 25, 2025
@haircommander
Copy link
Contributor

@toVersus would you like to be assigned this?

@toVersus
Copy link
Contributor

Yes, if there isn’t anyone more suitable, I’ll continue handling it.

@toVersus
Copy link
Contributor

/assign

@toVersus
Copy link
Contributor

toVersus commented Apr 1, 2025

Even after #130383 was merged, the exit code 2 error still occurs. It doesn't seem to be caused by the base OS of the container image.
https://storage.googleapis.com/k8s-triage/index.html?date=2025-03-31&test=ods%20Extended%20Pod%20Container%20Status%20should%20never%20report%20container%20start%20when%20an%20init%20container%20fails&xjob=e2e-kops#363c89d263aaa7250dc8

As Sergey pointed out in #129800 (comment), the fact that the container runtime returns exit code 2 when terminated the container remains unchanged.
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kind-beta-features/1905801627755876352/artifacts/kind-worker/containerd.log

Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.008343558Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:pod-terminate-status-1-13,Uid:f63e3be4-84b0-47db-a543-9a919035ab56,Namespace:pods-2803,Attempt:0,} returns sandbox id \"da02ff327c7d452d6fc6944385b17358a802b81a7fad48cc51526a8eeb4cfacf\""
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.073735382Z" level=info msg="CreateContainer within sandbox \"da02ff327c7d452d6fc6944385b17358a802b81a7fad48cc51526a8eeb4cfacf\" for &ContainerMetadata{Name:fail,Attempt:0,} returns container id \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\""
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.299036490Z" level=info msg="StartContainer for \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" returns successfully"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.320168675Z" level=info msg="StopContainer for \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" with timeout 2 (s)"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.320928752Z" level=info msg="Stop container \"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\" with signal terminated"
(...)
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.395960909Z" level=info msg="received exit event container_id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  pid:193518  exit_status:2  exited_at:{seconds:1743214912  nanos:395373051}"
Mar 29 02:21:52 kind-worker containerd[186]: time="2025-03-29T02:21:52.396888084Z" level=info msg="TaskExit event in podsandbox handler container_id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  id:\"8ecb5f161fd700c4770938bfc28fdca0e99fa3b1834e4ac240621b1f82393f74\"  pid:193518  exit_status:2  exited_at:{seconds:1743214912  nanos:395373051}"

We tested a container that exits with code 1 in both regular containers and init containers, and only the init container sometimes exits with code 2. I don't understand why this only happens in the init container.

ginkgo.It("should never report success for a pending container", func(ctx context.Context) {
ginkgo.By("creating pods that should always exit 1 and terminating the pod after a random delay")
createAndTestPodRepeatedly(ctx,
3, 15,
podFastDeleteScenario{client: podClient.PodInterface, delayMs: 2000},
podClient.PodInterface,
)
})
ginkgo.It("should never report container start when an init container fails", func(ctx context.Context) {
ginkgo.By("creating pods with an init container that always exit 1 and terminating the pod after a random delay")
createAndTestPodRepeatedly(ctx,
3, 15,
podFastDeleteScenario{client: podClient.PodInterface, delayMs: 2000, initContainer: true},
podClient.PodInterface,
)
})

@wendy-ha18
Copy link
Member

Hi folks, thanks for your support and attention on this issue!
The release cycle for v1.34 will start soon. Since this is still open, I will carry it over to the latest milestone.

/milestone 1.34

@k8s-ci-robot
Copy link
Contributor

@wendy-ha18: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.26, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]

Use /milestone clear to clear the milestone.

In response to this:

Hi folks, thanks for your support and attention on this issue!
The release cycle for v1.34 will start soon. Since this is still open, I will carry it over to the latest milestone.

/milestone 1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wendy-ha18
Copy link
Member

/milestone v1.34

@Rajalakshmi-Girish
Copy link
Contributor

Most recently failed on 19thMay with same error message as in description:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1924496663112585216
Hence having this issue on CI Signal Board for v1.34 release too!

@Rajalakshmi-Girish
Copy link
Contributor

@toVersus ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
Status: Pending inclusion
Development

No branches or pull requests

10 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy