Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raycluster_controller: generate events for failed pod creation #2286

Merged

Conversation

MadhavJivrajani
Copy link
Contributor

Why are these changes needed?

Generate events for when the raycluster_controller fails to create:

  • Head pods
  • Worker pods

The event generated has EventTypeWarning as the event type. This commit also introduces the following event reasons as constants for ease of use and testing:

  • FailedToCreateResource ("Failed")
  • CreatedResource ("Created")
  • DeletedResource ("Deleted")

This commit additionally adds in tests to verify this behaviour.

Related issue number

Towards #2250

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

cc @rueian @kevin85421

@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch from 93697a9 to 36859cf Compare August 5, 2024 12:07
@MadhavJivrajani MadhavJivrajani changed the title raycluster_controller: generate events for failed pod creation [WIP] raycluster_controller: generate events for failed pod creation Aug 5, 2024
@MadhavJivrajani
Copy link
Contributor Author

Once we can reach agreement on the right way to do this, we can also do something similar for #2189 and #2210

@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch 3 times, most recently from ac2f2a7 to 3f02a29 Compare August 5, 2024 13:53
@kevin85421 kevin85421 self-assigned this Aug 5, 2024
@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch from 3f02a29 to 3384881 Compare August 5, 2024 17:23
@MadhavJivrajani MadhavJivrajani changed the title [WIP] raycluster_controller: generate events for failed pod creation raycluster_controller: generate events for failed pod creation Aug 7, 2024
@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch from b975209 to dff7bf6 Compare August 7, 2024 15:24
@MadhavJivrajani MadhavJivrajani requested a review from rueian August 7, 2024 15:25
@rueian
Copy link
Contributor

rueian commented Aug 7, 2024

image

There is also on linter issue to fix.

Generate events for when the raycluster_controller fails to create:
- Head pods
- Worker pods

The event generated has EventTypeWarning as the event type. This commit
also introduces the following event reasons as constants for ease of use
and testing:
- FailedToCreateResource ("Failed")
- CreatedResource ("Created")
- DeletedResource ("Deleted")

This commit additionally adds in tests to verify this behaviour.

Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch from dff7bf6 to e9a25d9 Compare August 7, 2024 17:42
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
@MadhavJivrajani
Copy link
Contributor Author

MadhavJivrajani commented Aug 13, 2024

@rueian I've added failure events for all failed requests made to the Kubernetes API as part of this PR itself. Can you please take an initial look?

I'll need a day or two to add in the tests as well, we can probably refactor it and add a single high level test for all events.

@MadhavJivrajani
Copy link
Contributor Author

Also, currently, I've used the same event const for all purposes.
But for example, in the Kubelet event definition, there's distinct types defined for each category of events: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/events/event.go

Do we want to go a similar route?

@rueian
Copy link
Contributor

rueian commented Aug 13, 2024

Also, currently, I've used the same event const for all purposes. But for example, in the Kubelet event definition, there's distinct types defined for each category of events: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/events/event.go

Do we want to go a similar route?

Yes, we indeed need different event reasons for different categories. The current reason list (Failed, Created, and Deleted) is not enough.

That is because the EventCorrelator will aggregate events by fields except the Event.Message. Therefore, if we don't separate them by different Event.Reason, they will be all aggregated.

image

https://github.com/kubernetes/client-go/blob/master/tools/record/events_cache.go#L424-L438

@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch from 000497f to 97aa676 Compare August 20, 2024 03:54
@MadhavJivrajani
Copy link
Contributor Author

@rueian thanks, that's super helpful!

Can you please take a look at the latest commit? It adds in event reason types for the following buckets:

  1. Worker pod events
  2. Head pod events
  3. Generic pod events (this is a seperate bucket for events like delete events for when a cluster is suspended)
  4. Service events
  5. Route events
  6. Role and RoleBinding events

Please lmk if this makess sense. Thank you!

… add failure events

This commit introduces event reasons for different resources so that the EventCorrelator
can collapse similar events.

This commit also adds in failure events for all failed API requests to the Kubernetes API.

Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
@MadhavJivrajani MadhavJivrajani force-pushed the events-failed-pod-creation branch from 97aa676 to 37395f9 Compare August 20, 2024 23:07
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a follow up PR to fix my comments.


// Worker pod event list
const (
CreatedWorkerPod = "CreatedWorkerPod"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used.

return err
}
logger.Info("Created pod", "Pod ", pod.GenerateName)
r.Recorder.Eventf(&instance, corev1.EventTypeNormal, "Created", "Created worker pod %s", pod.Name)
r.Recorder.Eventf(&instance, corev1.EventTypeNormal, CreatedHeadPod, "Created worker pod %s/%s", pod.Namespace, pod.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreatedWorkerPod

@@ -944,10 +1015,11 @@ func (r *RayClusterReconciler) createService(ctx context.Context, raySvc *corev1
logger.Info("Pod service already exist, no need to create")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: rename raySvc.

@kevin85421 kevin85421 merged commit be9c5e4 into ray-project:master Aug 31, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy