-
Notifications
You must be signed in to change notification settings - Fork 49
blog: Add post on introducing Kubeflow Trainer V2 #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work ! Thank you so much @kramaranya
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads
- Abstract Kubernetes complexity from data scientists | ||
- Consolidate efforts between Kubernetes Batch WG and Kubeflow community | ||
|
||
We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We’re deeply grateful to all contributors and community members who made the Trainer v2 possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata: | ||
name: pytorch-simple | ||
namespace: kubeflow | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall you also set numNodes: 2 ?
# LLM Fine-Tuning Support | ||
|
||
Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers: | ||
- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should say that in the first release we will support torchtune
Runtimes for LLama models.
cc @Electronic-Waste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. We need to say that in the first release:
- We support
TorchTune LLM Trainer
as one option inBuiltinTrainer
. - For
TorchTune LLM Trainer
, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only supportLlama-3.2-1B-Instruct
andLlama-3.2-3B-Instruct
in manifests respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! updated in 336b058
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, kubeflow/wg-training-leads. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kramaranya Huge thanks for this. And thank you for your mentioning @andreyvelich.I left some suggestions with regard to the LLM Fine-Tuning Support
section.
# LLM Fine-Tuning Support | ||
|
||
Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers: | ||
- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. We need to say that in the first release:
- We support
TorchTune LLM Trainer
as one option inBuiltinTrainer
. - For
TorchTune LLM Trainer
, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only supportLlama-3.2-1B-Instruct
andLlama-3.2-3B-Instruct
in manifests respectively.
job_name = TrainerClient().train( | ||
trainer=BuiltinTrainer( | ||
config=TorchTuneConfig( | ||
dtype="bf16", | ||
batch_size=1, | ||
epochs=1, | ||
num_nodes=5, | ||
), | ||
), | ||
initializer=Initializer( | ||
dataset=HuggingFaceDatasetInitializer( | ||
storage_uri="tatsu-lab/alpaca", | ||
) | ||
), | ||
runtime=Runtime( | ||
name="torchtune-llama3.1-8b", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job_name = TrainerClient().train( | |
trainer=BuiltinTrainer( | |
config=TorchTuneConfig( | |
dtype="bf16", | |
batch_size=1, | |
epochs=1, | |
num_nodes=5, | |
), | |
), | |
initializer=Initializer( | |
dataset=HuggingFaceDatasetInitializer( | |
storage_uri="tatsu-lab/alpaca", | |
) | |
), | |
runtime=Runtime( | |
name="torchtune-llama3.1-8b", | |
), | |
) | |
job_name = client.train( | |
runtime=Runtime( | |
name="torchtune-llama3.2-1b" | |
), | |
initializer=Initializer( | |
dataset=HuggingFaceDatasetInitializer( | |
storage_uri="hf://tatsu-lab/alpaca/data" | |
), | |
model=HuggingFaceModelInitializer( | |
storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct", | |
access_token="<YOUR_HF_TOKEN>" # Replace with your Hugging Face token, | |
) | |
), | |
trainer=BuiltinTrainer( | |
config=TorchTuneConfig( | |
dataset_preprocess_config=TorchTuneInstructDataset( | |
source=DataFormat.PARQUET, | |
), | |
resources_per_node={ | |
"gpu": 1, | |
} | |
) | |
) | |
) |
Maybe we need to switch to a runnable example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you can also say that, "For more details, please refer to this example".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, thanks @Electronic-Waste!
updated in 336b058
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com> Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kramaranya Please can you update this diagram as well ?https://www.kubeflow.org/docs/components/trainer/overview/#who-is-this-for
|
||
The diagram below shows how different personas interact with these custom resources: | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use user-personas diagram here and delete the other one ?
job_name = client.train( | ||
runtime=client.get_runtime("torch-distributed"), | ||
trainer=CustomTrainer( | ||
func=my_train_func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kramaranya Did you get a chance to check it ?
- **[Native Kueue integration](https://github.com/kubernetes-sigs/kueue/issues/3884)** - improve resource management and scheduling capabilities for TrainJob resources | ||
- **[Model Registry integrations](https://github.com/kubeflow/trainer/issues/2245)** - export trained models directly to Model Registry | ||
|
||
For users migrating from **Trainer v1**, check out a [**Migration Guide**](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should highlight it in a separate section ?
And we should also say migrating from Kubeflow Training Operator v1.
title: "Introducing Kubeflow Trainer V2" | ||
hide: false | ||
permalink: /trainer/intro/ | ||
author: "AutoML & Training WG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Kubeflow Trainer Team"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me, thanks!
wdyt @andreyvelich?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Kubeflow Trainer Team sounds good!
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
author: "AutoML & Training WG" | ||
--- | ||
|
||
Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "hide this complexity"
|
||
**The main goals of KF Trainer v2 include:** | ||
- Make AI/ML workloads easier to manage at scale | ||
- Improve the Python interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Improve the Python interface | |
- Provide a Pythonic interface to train models |
|
||
Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners. | ||
|
||
**The main goals of KF Trainer v2 include:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**The main goals of KF Trainer v2 include:** | |
**The main goals of Kubeflow Trainer v2 include:** |
- Abstract Kubernetes complexity from AI Practitioners | ||
- Consolidate efforts between Kubernetes Batch WG and Kubeflow community | ||
|
||
We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. We would also like to highlight [ahg-g](https://github.com/ahg-g), [kannon92](https://github.com/kannon92), and [vsoch](https://github.com/vsoch) whose feedback was essential while we designed the Kubeflow Trainer architecture together with the Batch WG. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: break lines to keep one sentence per line.
|
||
**Trainer v2** leverages these Kubernetes-native improvements to re-use existing functionality and not reinventing the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes. | ||
|
||
# Division of Labor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labor sounds a bit too laborious 😃. Maybe just "User Personas" or "For AI practitioners and MLOps engineers"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kramaranya Did you get a chance to check it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me :) I'm leaning towards "User Personas".
Another option I was considering was "Personas: Platform Engineers and AI Practitioners", but "User Personas" seems a better option in case we change personas later again.
cc @andreyvelich @franciscojavierarceo @tenzen-y @Electronic-Waste any preferences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, User Personas make sense to me.
|
||
Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners. | ||
|
||
**The main goals of KF Trainer v2 include:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be obvious for everyone involved in the project but it doesn't seem to me like very explicit / prominent in this article: PyTorch :)
I'd try to message that Kubeflow trainer v2 is the easiest and most scalable way to run PyTorch distributed training on Kubernetes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, emphasis that PyTorch is the primary framework for us makes sense.
Let's include this as one of the main goals.
WDYT @kramaranya @Electronic-Waste @tenzen-y @franciscojavierarceo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kramaranya Did you get a chance to check it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I do agree we should emphasize on this point.
I'm leaning toward modifying the current goal "Make AI/ML workloads easier to manage at scale" to be:
"Make AI/ML workloads easier to manage at scale, with PyTorch as the primary framework"
And then modify an intro:
"Running machine learning workloads on Kubernetes can be challenging. Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs."
Alternately, we could just add a new goal with no intro chnages:
"Deliver the easiest and most scalable PyTorch distributed training on Kubernetes"
What do you think @astefanutti @andreyvelich ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich could you also take a look at ^^, so I can update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description looks good.
For the goals, we can leave the goal to make aiml workloads easier to scale as it is, and just another goal for the PyTorch, as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, updated in aa4f6e7. @astefanutti please let me know what you think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks!
Signed-off-by: kramaranya <kramaranya15@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks a lot @kramaranya!
``` | ||
|
||
Currently, **KF Trainer v2** supports the **Co-Scheduling plugin** from [Kubernetes scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins) project. | ||
**[Volcano scheduler support](https://github.com/kubeflow/trainer/pull/2672)** is coming in future releases to provide more advanced scheduling capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say Volcano and KAI Scheduler: kubeflow/trainer#2663
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated:)
|
||
The diagram above shows how this works in practice - the **KF Trainer** automatically **handles the SSH key generation** and **MPI communication** between training pods, which allows frameworks like DeepSpeed to coordinate training across multiple GPU nodes without requiring manual configuration of inter-node communication. | ||
|
||
# Fault Tolerance Improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Just added comment about KAI
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Thanks for this huge effort @kramaranya! /assign @tenzen-y @johnugeorge @terrytangyuan @astefanutti @franciscojavierarceo @Electronic-Waste @tarekabouzeid |
Signed-off-by: kramaranya <kramaranya15@gmail.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kramaranya Thanks for this great work! Just one nits.
.gitignore
Outdated
@@ -11,3 +11,4 @@ _notebooks/.ipynb_checkpoints | |||
.netlify | |||
.tweet-cache | |||
__pycache__ | |||
.idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.idea | |
.idea | |
Need a new blank line here
Signed-off-by: kramaranya <kramaranya15@gmail.com>
/lgtm Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @kramaranya!
I added a few minor suggestions and clarifying questions.
|
||
Running machine learning workloads on Kubernetes can be challenging. | ||
Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. | ||
The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to make this more generic.
The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs. | |
The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed machine learning jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyTorch focus is intentional here because we want to emphasize that the main goal of Trainer v2 is specifically to make distributed PyTorch jobs easier to run, see #169 (comment)
# Python SDK | ||
|
||
**The KF Trainer v2** introduces a **redesigned Python SDK**, which is intended to be the **primary interface for AI Practitioners**. | ||
The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by providing a unified interface across cloud environments?
The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity. | |
The SDK provides the same interface for multiple ML frameworks, and abstracts the underlying complexities of Kubernetes and cloud environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means you can use the same SDK commands and configurations for any cloud provider, without needing to learn different APIs for each platform. I think 'a unified interface' works better here, comparing to 'the same interface'. wdyt
https://www.kubeflow.org/docs/components/trainer/overview/#what-is-kubeflow-trainer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unified interface makes sense for me.
# Simplified API | ||
|
||
Previously, in the **Kubeflow Training Operator** users worked with different custom resources for each ML framework, each with their own framework-specific configurations. | ||
The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**. | |
**Kubeflow Trainer v2** replaces these multiple CRDs with a **unified TrainJob CRD** that works with **multiple ML frameworks**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To stay consistent, we should keep KF Trainer v2
, and I would keep API
to avoid duplication :)
One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks. | ||
The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that the pipeline framework was introduced to make it easier for Kubeflow Trainer developers to support adding new frameworks to Trainer, and was not a user-facing change.
@andreyvelich @tenzen-y do we intend to document how users can implement custom plugins?
Suggest replacing "customers" with "users":
One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks. | |
The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks. | |
The v2 architecture addresses this by introducing a **Pipeline Framework** that allows users to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the doc is work in progress by @IRONICBo is here: kubeflow/website#4039
@kramaranya @eoinfennessy Maybe we could be more explicit here, and say that allows platform administrators to extend the Plugins ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in 4280688
@eoinfennessy: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: kramaranya <kramaranya15@gmail.com>
New changes are detected. LGTM label has been removed. |
|
||
 | ||
|
||
- **Platform Engineers** define and manage **the infrastructure configurations** required for training jobs using `TrainingRuntimes` or `ClusterTrainingRuntimes`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, @andreyvelich should it actually be Platform Administrators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's keep the persona name consistent please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated in 4280688
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Closes #168
cc @kubeflow/wg-automl-leads @andreyvelich @johnugeorge @terrytangyuan @tenzen-y @franciscojavierarceo @astefanutti @Electronic-Waste @varodrig @tarekabouzeid @briangallagher @szaher @eoinfennessy