Skip to content

blog: Add post on introducing Kubeflow Trainer V2 #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

kramaranya
Copy link

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Copy link
Member

@tarekabouzeid tarekabouzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work ! Thank you so much @kramaranya

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads

- Abstract Kubernetes complexity from data scientists
- Consolidate efforts between Kubernetes Batch WG and Kubeflow community

We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We’re deeply grateful to all contributors and community members who made the Trainer v2 possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to also highlight @ahg-g@kannon92, and @vsoch contributions here, since their feedback was essential while we designed the Kubeflow Trainer architecture last year together with the Batch WG.

WDYT @tenzen-y ?

metadata:
name: pytorch-simple
namespace: kubeflow
spec:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall you also set numNodes: 2 ?

# LLM Fine-Tuning Support

Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers:
- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should say that in the first release we will support torchtune Runtimes for LLama models.
cc @Electronic-Waste

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. We need to say that in the first release:

  1. We support TorchTune LLM Trainer as one option in BuiltinTrainer.
  2. For TorchTune LLM Trainer, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in manifests respectively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! updated in 336b058

Copy link
Contributor

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, kubeflow/wg-training-leads.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Huge thanks for this. And thank you for your mentioning @andreyvelich.I left some suggestions with regard to the LLM Fine-Tuning Support section.

# LLM Fine-Tuning Support

Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers:
- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. We need to say that in the first release:

  1. We support TorchTune LLM Trainer as one option in BuiltinTrainer.
  2. For TorchTune LLM Trainer, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in manifests respectively.

Comment on lines 165 to 182
job_name = TrainerClient().train(
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dtype="bf16",
batch_size=1,
epochs=1,
num_nodes=5,
),
),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="tatsu-lab/alpaca",
)
),
runtime=Runtime(
name="torchtune-llama3.1-8b",
),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
job_name = TrainerClient().train(
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dtype="bf16",
batch_size=1,
epochs=1,
num_nodes=5,
),
),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="tatsu-lab/alpaca",
)
),
runtime=Runtime(
name="torchtune-llama3.1-8b",
),
)
job_name = client.train(
runtime=Runtime(
name="torchtune-llama3.2-1b"
),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="hf://tatsu-lab/alpaca/data"
),
model=HuggingFaceModelInitializer(
storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
access_token="<YOUR_HF_TOKEN>" # Replace with your Hugging Face token,
)
),
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dataset_preprocess_config=TorchTuneInstructDataset(
source=DataFormat.PARQUET,
),
resources_per_node={
"gpu": 1,
}
)
)
)

Maybe we need to switch to a runnable example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can also say that, "For more details, please refer to this example".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, thanks @Electronic-Waste!
updated in 336b058

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The diagram below shows how different personas interact with these custom resources:

![division_of_labor](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use user-personas diagram here and delete the other one ?

job_name = client.train(
runtime=client.get_runtime("torch-distributed"),
trainer=CustomTrainer(
func=my_train_func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Did you get a chance to check it ?

- **[Native Kueue integration](https://github.com/kubernetes-sigs/kueue/issues/3884)** - improve resource management and scheduling capabilities for TrainJob resources
- **[Model Registry integrations](https://github.com/kubeflow/trainer/issues/2245)** - export trained models directly to Model Registry

For users migrating from **Trainer v1**, check out a [**Migration Guide**](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should highlight it in a separate section ?
And we should also say migrating from Kubeflow Training Operator v1.

title: "Introducing Kubeflow Trainer V2"
hide: false
permalink: /trainer/intro/
author: "AutoML & Training WG"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Kubeflow Trainer Team"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me, thanks!
wdyt @andreyvelich?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Kubeflow Trainer Team sounds good!

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
author: "AutoML & Training WG"
---

Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "hide this complexity"


**The main goals of KF Trainer v2 include:**
- Make AI/ML workloads easier to manage at scale
- Improve the Python interface

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Improve the Python interface
- Provide a Pythonic interface to train models


Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

**The main goals of KF Trainer v2 include:**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**The main goals of KF Trainer v2 include:**
**The main goals of Kubeflow Trainer v2 include:**

- Abstract Kubernetes complexity from AI Practitioners
- Consolidate efforts between Kubernetes Batch WG and Kubeflow community

We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. We would also like to highlight [ahg-g](https://github.com/ahg-g), [kannon92](https://github.com/kannon92), and [vsoch](https://github.com/vsoch) whose feedback was essential while we designed the Kubeflow Trainer architecture together with the Batch WG. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: break lines to keep one sentence per line.


**Trainer v2** leverages these Kubernetes-native improvements to re-use existing functionality and not reinventing the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes.

# Division of Labor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labor sounds a bit too laborious 😃. Maybe just "User Personas" or "For AI practitioners and MLOps engineers"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Did you get a chance to check it ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me :) I'm leaning towards "User Personas".
Another option I was considering was "Personas: Platform Engineers and AI Practitioners", but "User Personas" seems a better option in case we change personas later again.
cc @andreyvelich @franciscojavierarceo @tenzen-y @Electronic-Waste any preferences?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, User Personas make sense to me.


Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

**The main goals of KF Trainer v2 include:**
Copy link

@astefanutti astefanutti Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be obvious for everyone involved in the project but it doesn't seem to me like very explicit / prominent in this article: PyTorch :)

I'd try to message that Kubeflow trainer v2 is the easiest and most scalable way to run PyTorch distributed training on Kubernetes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, emphasis that PyTorch is the primary framework for us makes sense.
Let's include this as one of the main goals.
WDYT @kramaranya @Electronic-Waste @tenzen-y @franciscojavierarceo ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Did you get a chance to check it ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do agree we should emphasize on this point.

I'm leaning toward modifying the current goal "Make AI/ML workloads easier to manage at scale" to be:
"Make AI/ML workloads easier to manage at scale, with PyTorch as the primary framework"

And then modify an intro:
"Running machine learning workloads on Kubernetes can be challenging. Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs."

Alternately, we could just add a new goal with no intro chnages:
"Deliver the easiest and most scalable PyTorch distributed training on Kubernetes"

What do you think @astefanutti @andreyvelich ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich could you also take a look at ^^, so I can update it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description looks good.
For the goals, we can leave the goal to make aiml workloads easier to scale as it is, and just another goal for the PyTorch, as you said.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, updated in aa4f6e7. @astefanutti please let me know what you think :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot @kramaranya!

```

Currently, **KF Trainer v2** supports the **Co-Scheduling plugin** from [Kubernetes scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins) project.
**[Volcano scheduler support](https://github.com/kubeflow/trainer/pull/2672)** is coming in future releases to provide more advanced scheduling capabilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say Volcano and KAI Scheduler: kubeflow/trainer#2663

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated:)


The diagram above shows how this works in practice - the **KF Trainer** automatically **handles the SSH key generation** and **MPI communication** between training pods, which allows frameworks like DeepSpeed to coordinate training across multiple GPU nodes without requiring manual configuration of inter-node communication.

# Fault Tolerance Improvements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Just added comment about KAI

Signed-off-by: kramaranya <kramaranya15@gmail.com>
@andreyvelich
Copy link
Member

Thanks for this huge effort @kramaranya!
/lgtm
/hold Let's merge it tomorrow!

/assign @tenzen-y @johnugeorge @terrytangyuan @astefanutti @franciscojavierarceo @Electronic-Waste @tarekabouzeid

Signed-off-by: kramaranya <kramaranya15@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Jul 20, 2025
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Thanks for this great work! Just one nits.

.gitignore Outdated
@@ -11,3 +11,4 @@ _notebooks/.ipynb_checkpoints
.netlify
.tweet-cache
__pycache__
.idea
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.idea
.idea

Need a new blank line here

Signed-off-by: kramaranya <kramaranya15@gmail.com>
@astefanutti
Copy link

/lgtm

Thanks!

@google-oss-prow google-oss-prow bot added the lgtm label Jul 21, 2025
Copy link

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @kramaranya!

I added a few minor suggestions and clarifying questions.


Running machine learning workloads on Kubernetes can be challenging.
Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge.
The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to make this more generic.

Suggested change
The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.
The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed machine learning jobs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyTorch focus is intentional here because we want to emphasize that the main goal of Trainer v2 is specifically to make distributed PyTorch jobs easier to run, see #169 (comment)

# Python SDK

**The KF Trainer v2** introduces a **redesigned Python SDK**, which is intended to be the **primary interface for AI Practitioners**.
The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by providing a unified interface across cloud environments?

Suggested change
The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.
The SDK provides the same interface for multiple ML frameworks, and abstracts the underlying complexities of Kubernetes and cloud environments.

Copy link
Author

@kramaranya kramaranya Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you can use the same SDK commands and configurations for any cloud provider, without needing to learn different APIs for each platform. I think 'a unified interface' works better here, comparing to 'the same interface'. wdyt
https://www.kubeflow.org/docs/components/trainer/overview/#what-is-kubeflow-trainer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified interface makes sense for me.

# Simplified API

Previously, in the **Kubeflow Training Operator** users worked with different custom resources for each ML framework, each with their own framework-specific configurations.
The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**.
**Kubeflow Trainer v2** replaces these multiple CRDs with a **unified TrainJob CRD** that works with **multiple ML frameworks**.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stay consistent, we should keep KF Trainer v2, and I would keep API to avoid duplication :)

Comment on lines 204 to 205
One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks.
The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the pipeline framework was introduced to make it easier for Kubeflow Trainer developers to support adding new frameworks to Trainer, and was not a user-facing change.

@andreyvelich @tenzen-y do we intend to document how users can implement custom plugins?

Suggest replacing "customers" with "users":

Suggested change
One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks.
The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.
The v2 architecture addresses this by introducing a **Pipeline Framework** that allows users to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.

Copy link
Member

@andreyvelich andreyvelich Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the doc is work in progress by @IRONICBo is here: kubeflow/website#4039

@kramaranya @eoinfennessy Maybe we could be more explicit here, and say that allows platform administrators to extend the Plugins ... ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 4280688

Copy link
Contributor

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

This is great, thanks @kramaranya!

I added a few minor suggestions and clarifying questions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@google-oss-prow google-oss-prow bot removed the lgtm label Jul 21, 2025

![user_personas](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)

- **Platform Engineers** define and manage **the infrastructure configurations** required for training jobs using `TrainingRuntimes` or `ClusterTrainingRuntimes`.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @andreyvelich should it actually be Platform Administrators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep the persona name consistent please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated in 4280688

Signed-off-by: kramaranya <kramaranya15@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Blog Post Introducing Kubeflow Trainer V2
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy