[WIP]Adjust durations for PodLifecycleSleepAction e2e tests. #128642

AxeZhan · 2024-11-07T05:11:28Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pr adjusts the duration of the e2e tests of PodLifecycleSleepAction, to avoid flakes.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

AxeZhan · 2024-11-07T05:11:43Z

/hold

Needs some rounds of tests.

AxeZhan · 2024-11-07T07:43:55Z

/test all

AxeZhan · 2024-11-07T09:18:39Z

/test all

AxeZhan · 2024-11-07T10:32:27Z

/test all

aojea · 2024-11-07T12:54:18Z

test/e2e/common/node/lifecycle_hook.go

 				},
 			}
 			podWithHook := getPodWithHook("pod-with-prestop-sleep-hook", imageutils.GetPauseImageName(), lifecycle)
+			podWithHook.Spec.TerminationGracePeriodSeconds = ptr.To[int64](60)
 			ginkgo.By("create the pod with lifecycle hook using sleep action")
 			podClient.CreateSync(ctx, podWithHook)
 			ginkgo.By("delete the pod with lifecycle hook using sleep action")
 			start := time.Now()
 			podClient.DeleteSync(ctx, podWithHook.Name, metav1.DeleteOptions{}, e2epod.DefaultPodDeletionTimeout)


This test relies on sending a Delete request and wait for the pod not present in the namespace, this sounds very brittle to me, and the solution to consider very large windows of absorb any possible CI latency makes that we'll not be completely sure that the feature is working as expected.

@thockin don't we have any other method to validate this feature than relying on the overall time for the pod to disappear?

The main logic of the sleep hook is in the container lifecycle management code within kubelet, where it waits for a certain period when a container terminates. This means we may not have a way to monitor this process other than by observing the pod status.

However, instead of waiting for the pod to disappear from the namespace after deletion, there might be an alternative approach. This approach doesn't require the full lifecycle of creating and deleting the pod, which could yield higher accuracy than the current method.

We create a pod with a restartPolicy of Always and add a livenessProbe that will fail quickly. When the livenessProbe fails, kubelet will send a TERM signal to the container and start executing the sleep hook.

After the sleep hook executes, kubelet will increment the container's restart count by one. We only need to check the time interval between when the restart count reaches 1 and when the container enters the Running state.

An additional benefit of this method is that we don't need to verify if the time we capture falls between the sleep hook execution and the TerminationGracePeriodSeconds(This greatly reduces the probability of test failure). We don’t need to worry about the value of TerminationGracePeriodSeconds; it can be an arbitrarily large value.

A possible pod is:

apiVersion: v1 kind: Pod metadata: name: auto-restart-pod spec: terminationGracePeriodSeconds: 90 containers: - name: auto-restart-container image: k8s.gcr.io/pause:3.2 livenessProbe: tcpSocket: port: 80 periodSeconds: 1 lifecycle: preStop: sleep: seconds: 10 restartPolicy: Always

/cc @SergeyKanzhelev @kannon92 WDYT?

Can we reuse the logic in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/container_probe.go?

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

Yes, we can redesign this test based on this code.

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

10s is the time I set for local testing. We can make it longer in CI.

I don't have a better answer, sadly. This feature is fundamentally about sleeping. That said, the implementation of this is pretty trivial, so I think it's low risk of false-pass.

We can set the sleep pretty short (5-10s), the grace period very long (2-5m) and assert that the observed time was between sleep (O(seconds) and grace (O(minutes)). It doesn't need to be super tight, we just don't want to waste time.

Currently my implementation is same as @thockin suggest.
@HirazawaUi has given a great idea here, but I'm not sure if it's worth to implement this way here. We still need to check the duration in this way and can also show flaky?

Notice that although now we make a super long grace period, these tests generally don't last that long.
In the first test, we verify a normal sleepAction (15s), so deletion starts immediately after 15 seconds.
In the second test, we change the grace period to 15s, so deletion also starts immediately after 15 seconds.
In the third test, we test an exceptional case where sleepAction is ignored, and deletion proceeds immediately.

adding a finalizer and checking pod conditions can help us here ?
the finalizer will block the deletion of the pod object IIRC, is there any condition in the pod status that can serve us to calculate the sleep time?

is there any condition in the pod status that can serve us to calculate the sleep time?

I think only the .metadata.deletionTimestamp can help here?

commented on kubernetes/enhancements#3960 (comment)

kannon92 · 2024-11-20T18:57:02Z

/priority important-soon
/triage accepted

kannon92 · 2024-11-20T18:57:21Z

Please aim to address this in 1.33. We would like to promote this feature once these tests are stable.

AxeZhan · 2024-11-21T02:57:43Z

/retest

AxeZhan · 2024-11-21T03:00:38Z

Please aim to address this in 1.33.

Sure, but some discussions are still ongoing, and we need to determine if we need to find a different way to calculate the sleep time.
However, personally, I lean towards the idea that expanding the check time range might already be sufficient?

k8s-triage-robot · 2025-04-22T08:23:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot · 2025-05-19T03:20:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AxeZhan
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/e2e/common/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AxeZhan · 2025-05-19T03:23:21Z

@kannon92 @thockin @aojea @HirazawaUi
Hi folks, do we have a better solution than just increase the time here and a good reason to implment it?
If not, I still tend to just increase the time.

aojea · 2025-05-19T23:57:56Z

test/e2e/common/node/lifecycle_hook.go

+			// longer than 15 seconds (pod should sleep for 5 seconds)
+			// shorter than gracePeriodSeconds (30 seconds here)
+			if !validDuration(finishAt.Time.Sub(deletionAt), 5, 30) {
+				framework.Failf("unexpected delay duration before killing the pod,  finishAt = %v, deletionAt= %v", finishAt, deletionAt)


Dump the entire Pod se we can inspect all the fields, maybe we can get some ideas from the conditions or something

I dumped the entire pod in json format, and didn't get much useful information.
creationTimestamp : "2025-05-20T07:02:11Z"
deletionTimestamp: "2025-05-20T07:02:43Z"

startedAt: "2025-05-20T07:02:12Z"
finishedAt: "2025-05-20T07:02:20Z"

I tried these steps locally on my kind cluster, and get the same result(deletionTimestamp is later than finishedAt). I also tried deleting without finalizer, and it shows that pre-stop hook indeed is working.
Is there something I'm missing so the deletionTimestamp is incorrectly set?

k8s-ci-robot · 2025-05-20T07:32:16Z

@AxeZhan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-linter-hints	`d2cbbc3`	link	false	`/test pull-kubernetes-linter-hints`
pull-kubernetes-e2e-kind	`d2cbbc3`	link	true	`/test pull-kubernetes-e2e-kind`
pull-kubernetes-e2e-kind-ipv6	`d2cbbc3`	link	true	`/test pull-kubernetes-e2e-kind-ipv6`
pull-kubernetes-e2e-gce	`d2cbbc3`	link	true	`/test pull-kubernetes-e2e-gce`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

thockin · 2025-05-20T14:37:10Z

Is kubelet setting deletionTimestamp to now or now+tGPS? I am guessing the latter, which makes sense from a certain POV,.

k8s-ci-robot · 2025-05-22T13:56:59Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from haircommander and kannon92 November 7, 2024 05:11

AxeZhan mentioned this pull request Nov 7, 2024

Graduate PodLifecycleSleepAction to GA #128046

Merged

AxeZhan force-pushed the test3960 branch 2 times, most recently from 8580cb7 to b8fec3e Compare November 7, 2024 06:31

aojea reviewed Nov 7, 2024

View reviewed changes

AxeZhan force-pushed the test3960 branch from b8fec3e to a6182e1 Compare November 9, 2024 06:04

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2025

AxeZhan mentioned this pull request May 19, 2025

Introducing Sleep Action for PreStop Hook kubernetes/enhancements#3960

Open

12 tasks

AxeZhan force-pushed the test3960 branch from a6182e1 to a59fce4 Compare May 19, 2025 03:20

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2025

adjust durations for sleepAction

8747d5c

AxeZhan force-pushed the test3960 branch from a59fce4 to 8747d5c Compare May 19, 2025 04:30

AxeZhan changed the title ~~Adjust durations for PodLifecycleSleepAction e2e tests.~~ [WIP]Adjust durations for PodLifecycleSleepAction e2e tests. May 19, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2025

aojea reviewed May 19, 2025

View reviewed changes

AxeZhan force-pushed the test3960 branch 2 times, most recently from 43375f6 to 4aaa19a Compare May 20, 2025 05:21

using finalizer

d2cbbc3

AxeZhan force-pushed the test3960 branch from 4aaa19a to d2cbbc3 Compare May 20, 2025 06:36

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2025

[WIP]Adjust durations for PodLifecycleSleepAction e2e tests. #128642

Are you sure you want to change the base?

[WIP]Adjust durations for PodLifecycleSleepAction e2e tests. #128642

Conversation

AxeZhan commented Nov 7, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HirazawaUi Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Nov 20, 2024

Uh oh!

kannon92 commented Nov 20, 2024

Uh oh!

AxeZhan commented Nov 21, 2024

Uh oh!

AxeZhan commented Nov 21, 2024

Uh oh!

k8s-triage-robot commented Apr 22, 2025

Uh oh!

k8s-ci-robot commented May 19, 2025

Uh oh!

AxeZhan commented May 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 20, 2025

Uh oh!

thockin commented May 20, 2025

Uh oh!

k8s-ci-robot commented May 22, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

HirazawaUi Nov 7, 2024 •

edited

Loading