-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayCluster Headless Worker Service Should PublishNotReadyAddresses #2375
Conversation
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@@ -307,9 +307,10 @@ func BuildHeadlessServiceForRayCluster(rayCluster rayv1.RayCluster) (*corev1.Ser | |||
Labels: labels, | |||
}, | |||
Spec: corev1.ServiceSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For another PR, add unit tests for BuildHeadlessServiceForRayCluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and added a test case in this PR (c92042b) since it's a fairly small change.
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
actualSelector := svc.Spec.Selector[utils.RayClusterLabelKey] | ||
expectedSelector := instanceForSvc.Name | ||
if !reflect.DeepEqual(expectedSelector, actualSelector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier to just construct the expected Service object and do a reflect.DeepEqual
check against that and the returned object from BuildHeadlessServiceForRayCluster.
Maybe for a separate PR because this is just following the pattern of the other tests.
@kevin85421 PTAL |
Why are these changes needed?
This PR adds
PublishNotReadyAddresses: true
to the headless service created by KubeRay when a RayCluster requests multi-host TPU nodes. When creating a RayService for multi-host TPU inference, TPU initialization currently times out because 1 or more of the workers might be unreachable until a proxy actor is running on that node. This PR unblocks multi-host inference on TPUs with a RayService, since TPU initialization requires worker-to-worker communication even if all proxy actors haven't started yet.Related issue number
Checks
PublishNotReadyAddresses: true
and verified RayService deployment succeeded