-
Notifications
You must be signed in to change notification settings - Fork 40.6k
[WIP]Adjust durations for PodLifecycleSleepAction e2e tests. #128642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/hold Needs some rounds of tests. |
8580cb7
to
b8fec3e
Compare
/test all |
2 similar comments
/test all |
/test all |
}, | ||
} | ||
podWithHook := getPodWithHook("pod-with-prestop-sleep-hook", imageutils.GetPauseImageName(), lifecycle) | ||
podWithHook.Spec.TerminationGracePeriodSeconds = ptr.To[int64](60) | ||
ginkgo.By("create the pod with lifecycle hook using sleep action") | ||
podClient.CreateSync(ctx, podWithHook) | ||
ginkgo.By("delete the pod with lifecycle hook using sleep action") | ||
start := time.Now() | ||
podClient.DeleteSync(ctx, podWithHook.Name, metav1.DeleteOptions{}, e2epod.DefaultPodDeletionTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test relies on sending a Delete request and wait for the pod not present in the namespace, this sounds very brittle to me, and the solution to consider very large windows of absorb any possible CI latency makes that we'll not be completely sure that the feature is working as expected.
@thockin don't we have any other method to validate this feature than relying on the overall time for the pod to disappear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main logic of the sleep hook is in the container lifecycle management code within kubelet, where it waits for a certain period when a container terminates. This means we may not have a way to monitor this process other than by observing the pod status.
However, instead of waiting for the pod to disappear from the namespace after deletion, there might be an alternative approach. This approach doesn't require the full lifecycle of creating and deleting the pod, which could yield higher accuracy than the current method.
We create a pod with a restartPolicy
of Always
and add a livenessProbe
that will fail quickly. When the livenessProbe
fails, kubelet will send a TERM
signal to the container and start executing the sleep hook.
After the sleep hook executes, kubelet will increment the container's restart count by one. We only need to check the time interval between when the restart count reaches 1 and when the container enters the Running state.
An additional benefit of this method is that we don't need to verify if the time we capture falls between the sleep hook execution and the TerminationGracePeriodSeconds
(This greatly reduces the probability of test failure). We don’t need to worry about the value of TerminationGracePeriodSeconds
; it can be an arbitrarily large value.
A possible pod is:
apiVersion: v1
kind: Pod
metadata:
name: auto-restart-pod
spec:
terminationGracePeriodSeconds: 90
containers:
- name: auto-restart-container
image: k8s.gcr.io/pause:3.2
livenessProbe:
tcpSocket:
port: 80
periodSeconds: 1
lifecycle:
preStop:
sleep:
seconds: 10
restartPolicy: Always
/cc @SergeyKanzhelev @kannon92 WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the logic in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/container_probe.go?
I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.
Yes, we can redesign this test based on this code.
I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.
10s is the time I set for local testing. We can make it longer in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a better answer, sadly. This feature is fundamentally about sleeping. That said, the implementation of this is pretty trivial, so I think it's low risk of false-pass.
We can set the sleep pretty short (5-10s), the grace period very long (2-5m) and assert that the observed time was between sleep (O(seconds) and grace (O(minutes)). It doesn't need to be super tight, we just don't want to waste time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently my implementation is same as @thockin suggest.
@HirazawaUi has given a great idea here, but I'm not sure if it's worth to implement this way here. We still need to check the duration in this way and can also show flaky?
Notice that although now we make a super long grace period, these tests generally don't last that long.
In the first test, we verify a normal sleepAction
(15s), so deletion starts immediately after 15 seconds.
In the second test, we change the grace period to 15s, so deletion also starts immediately after 15 seconds.
In the third test, we test an exceptional case where sleepAction
is ignored, and deletion proceeds immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a finalizer and checking pod conditions can help us here ?
the finalizer will block the deletion of the pod object IIRC, is there any condition in the pod status that can serve us to calculate the sleep time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any condition in the pod status that can serve us to calculate the sleep time?
I think only the .metadata.deletionTimestamp
can help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented on kubernetes/enhancements#3960 (comment)
/priority important-soon |
Please aim to address this in 1.33. We would like to promote this feature once these tests are stable. |
/retest |
Sure, but some discussions are still ongoing, and we need to determine if we need to find a different way to calculate the sleep time. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: AxeZhan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kannon92 @thockin @aojea @HirazawaUi |
// longer than 15 seconds (pod should sleep for 5 seconds) | ||
// shorter than gracePeriodSeconds (30 seconds here) | ||
if !validDuration(finishAt.Time.Sub(deletionAt), 5, 30) { | ||
framework.Failf("unexpected delay duration before killing the pod, finishAt = %v, deletionAt= %v", finishAt, deletionAt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dump the entire Pod se we can inspect all the fields, maybe we can get some ideas from the conditions or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dumped the entire pod in json format, and didn't get much useful information.
creationTimestamp
: "2025-05-20T07:02:11Z"
deletionTimestamp
: "2025-05-20T07:02:43Z"
startedAt
: "2025-05-20T07:02:12Z"
finishedAt
: "2025-05-20T07:02:20Z"
I tried these steps locally on my kind cluster, and get the same result(deletionTimestamp is later than finishedAt). I also tried deleting without finalizer, and it shows that pre-stop hook indeed is working.
Is there something I'm missing so the deletionTimestamp
is incorrectly set?
43375f6
to
4aaa19a
Compare
@AxeZhan: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Is kubelet setting deletionTimestamp to |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This pr adjusts the duration of the e2e tests of PodLifecycleSleepAction, to avoid flakes.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: