Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RayCluster Headless Worker Service Should PublishNotReadyAddresses #2375

Merged
merged 7 commits into from
Sep 12, 2024

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR adds PublishNotReadyAddresses: true to the headless service created by KubeRay when a RayCluster requests multi-host TPU nodes. When creating a RayService for multi-host TPU inference, TPU initialization currently times out because 1 or more of the workers might be unreachable until a proxy actor is running on that node. This PR unblocks multi-host inference on TPUs with a RayService, since TPU initialization requires worker-to-worker communication even if all proxy actors haven't started yet.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests - edited a running headless-worker-svc with PublishNotReadyAddresses: true and verified RayService deployment succeeded
    • This PR is not tested :(

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

cc: @kevin85421 @andrewsykim

ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
@@ -307,9 +307,10 @@ func BuildHeadlessServiceForRayCluster(rayCluster rayv1.RayCluster) (*corev1.Ser
Labels: labels,
},
Spec: corev1.ServiceSpec{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another PR, add unit tests for BuildHeadlessServiceForRayCluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and added a test case in this PR (c92042b) since it's a fairly small change.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


actualSelector := svc.Spec.Selector[utils.RayClusterLabelKey]
expectedSelector := instanceForSvc.Name
if !reflect.DeepEqual(expectedSelector, actualSelector) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be easier to just construct the expected Service object and do a reflect.DeepEqual check against that and the returned object from BuildHeadlessServiceForRayCluster.

Maybe for a separate PR because this is just following the pattern of the other tests.

@andrewsykim
Copy link
Collaborator

@kevin85421 PTAL

@kevin85421 kevin85421 merged commit b8f6d06 into ray-project:master Sep 12, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy