blog: Add post on introducing Kubeflow Trainer V2 #169

kramaranya · 2025-07-14T06:41:50Z

Closes #168

cc @kubeflow/wg-automl-leads @andreyvelich @johnugeorge @terrytangyuan @tenzen-y @franciscojavierarceo @astefanutti @Electronic-Waste @varodrig @tarekabouzeid @briangallagher @szaher @eoinfennessy

Signed-off-by: kramaranya <kramaranya15@gmail.com>

_posts/2025-07-09-introducing-trainer-v2.md

tarekabouzeid

Great work ! Thank you so much @kramaranya

_posts/2025-07-09-introducing-trainer-v2.md

andreyvelich

Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads

_posts/2025-07-09-introducing-trainer-v2.md

andreyvelich · 2025-07-14T19:33:26Z

_posts/2025-07-09-introducing-trainer-v2.md

+- Abstract Kubernetes complexity from data scientists
+- Consolidate efforts between Kubernetes Batch WG and Kubeflow community
+
+We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We’re deeply grateful to all contributors and community members who made the Trainer v2 possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible.


I would like to also highlight @ahg-g, @kannon92, and @vsoch contributions here, since their feedback was essential while we designed the Kubeflow Trainer architecture last year together with the Batch WG.

WDYT @tenzen-y ?

_posts/2025-07-09-introducing-trainer-v2.md

andreyvelich · 2025-07-14T19:55:11Z

_posts/2025-07-09-introducing-trainer-v2.md

+metadata:
+  name: pytorch-simple
+  namespace: kubeflow
+spec:


Shall you also set numNodes: 2 ?

_posts/2025-07-09-introducing-trainer-v2.md

andreyvelich · 2025-07-14T20:01:09Z

_posts/2025-07-09-introducing-trainer-v2.md

+# LLM Fine-Tuning Support
+
+Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers:
+- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments,


We should say that in the first release we will support torchtune Runtimes for LLama models.
cc @Electronic-Waste

Yes, I agree. We need to say that in the first release:

We support TorchTune LLM Trainer as one option in BuiltinTrainer.

For TorchTune LLM Trainer, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in manifests respectively.

thank you! updated in 336b058

_posts/2025-07-09-introducing-trainer-v2.md

google-oss-prow · 2025-07-14T20:02:39Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, kubeflow/wg-training-leads.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste

@kramaranya Huge thanks for this. And thank you for your mentioning @andreyvelich.I left some suggestions with regard to the LLM Fine-Tuning Support section.

_posts/2025-07-09-introducing-trainer-v2.md

Electronic-Waste · 2025-07-15T16:14:20Z

_posts/2025-07-09-introducing-trainer-v2.md

+# LLM Fine-Tuning Support
+
+Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers:
+- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments,


Yes, I agree. We need to say that in the first release:

We support TorchTune LLM Trainer as one option in BuiltinTrainer.

For TorchTune LLM Trainer, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in manifests respectively.

Electronic-Waste · 2025-07-15T16:16:09Z

_posts/2025-07-09-introducing-trainer-v2.md

+job_name = TrainerClient().train(
+    trainer=BuiltinTrainer(
+        config=TorchTuneConfig(
+            dtype="bf16",
+            batch_size=1,
+            epochs=1,
+            num_nodes=5,
+        ),
+    ),
+    initializer=Initializer(
+        dataset=HuggingFaceDatasetInitializer(
+            storage_uri="tatsu-lab/alpaca",
+        )
+    ),
+    runtime=Runtime(
+      name="torchtune-llama3.1-8b",
+    ),
+)


Suggested change

job_name = TrainerClient().train(

trainer=BuiltinTrainer(

config=TorchTuneConfig(

dtype="bf16",

batch_size=1,

epochs=1,

num_nodes=5,

),

),

initializer=Initializer(

dataset=HuggingFaceDatasetInitializer(

storage_uri="tatsu-lab/alpaca",

)

),

runtime=Runtime(

name="torchtune-llama3.1-8b",

),

)

job_name = client.train(

runtime=Runtime(

name="torchtune-llama3.2-1b"

),

initializer=Initializer(

dataset=HuggingFaceDatasetInitializer(

storage_uri="hf://tatsu-lab/alpaca/data"

),

model=HuggingFaceModelInitializer(

storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",

access_token="<YOUR_HF_TOKEN>" # Replace with your Hugging Face token,

)

),

trainer=BuiltinTrainer(

config=TorchTuneConfig(

dataset_preprocess_config=TorchTuneInstructDataset(

source=DataFormat.PARQUET,

),

resources_per_node={

"gpu": 1,

}

)

)

)

Maybe we need to switch to a runnable example

And you can also say that, "For more details, please refer to this example".

sgtm, thanks @Electronic-Waste!
updated in 336b058

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Signed-off-by: kramaranya <kramaranya15@gmail.com> Signed-off-by: kramaranya <kramaranya15@gmail.com>

Signed-off-by: kramaranya <kramaranya15@gmail.com>

andreyvelich · 2025-07-17T16:01:00Z

images/2025-07-09-introducing-trainer-v2/division_of_labour.png

@kramaranya Please can you update this diagram as well ?https://www.kubeflow.org/docs/components/trainer/overview/#who-is-this-for

andreyvelich · 2025-07-17T16:01:44Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+The diagram below shows how different personas interact with these custom resources:
+
+![division_of_labor](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)


Can you use user-personas diagram here and delete the other one ?

andreyvelich · 2025-07-17T16:02:03Z

_posts/2025-07-09-introducing-trainer-v2.md

+job_name = client.train(
+  runtime=client.get_runtime("torch-distributed"),
+  trainer=CustomTrainer(
+    func=my_train_func,


@kramaranya Did you get a chance to check it ?

_posts/2025-07-09-introducing-trainer-v2.md

andreyvelich · 2025-07-17T18:50:23Z

_posts/2025-07-09-introducing-trainer-v2.md

+- **[Native Kueue integration](https://github.com/kubernetes-sigs/kueue/issues/3884)** - improve resource management and scheduling capabilities for TrainJob resources
+- **[Model Registry integrations](https://github.com/kubeflow/trainer/issues/2245)** - export trained models directly to Model Registry
+
+For users migrating from **Trainer v1**, check out a [**Migration Guide**](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/).


Maybe we should highlight it in a separate section ?
And we should also say migrating from Kubeflow Training Operator v1.

terrytangyuan · 2025-07-18T00:36:49Z

_posts/2025-07-09-introducing-trainer-v2.md

+title: "Introducing Kubeflow Trainer V2"
+hide: false
+permalink: /trainer/intro/
+author: "AutoML & Training WG"


Maybe "Kubeflow Trainer Team"?

sounds good to me, thanks!
wdyt @andreyvelich?

Yes, Kubeflow Trainer Team sounds good!

Signed-off-by: kramaranya <kramaranya15@gmail.com>

astefanutti · 2025-07-18T07:03:10Z

_posts/2025-07-09-introducing-trainer-v2.md

+author: "AutoML & Training WG"
+---
+
+Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.


nit: "hide this complexity"

astefanutti · 2025-07-18T07:04:19Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+**The main goals of KF Trainer v2 include:**
+- Make AI/ML workloads easier to manage at scale
+- Improve the Python interface


Suggested change

- Improve the Python interface

- Provide a Pythonic interface to train models

astefanutti · 2025-07-18T07:05:06Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.
+
+**The main goals of KF Trainer v2 include:**


Suggested change

**The main goals of KF Trainer v2 include:**

**The main goals of Kubeflow Trainer v2 include:**

astefanutti · 2025-07-18T07:06:33Z

_posts/2025-07-09-introducing-trainer-v2.md

+- Abstract Kubernetes complexity from AI Practitioners
+- Consolidate efforts between Kubernetes Batch WG and Kubeflow community
+
+We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. We would also like to highlight [ahg-g](https://github.com/ahg-g), [kannon92](https://github.com/kannon92), and [vsoch](https://github.com/vsoch) whose feedback was essential while we designed the Kubeflow Trainer architecture together with the Batch WG. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible.


nit: break lines to keep one sentence per line.

astefanutti · 2025-07-18T07:10:01Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+**Trainer v2** leverages these Kubernetes-native improvements to re-use existing functionality and not reinventing the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes.
+
+# Division of Labor


Labor sounds a bit too laborious 😃. Maybe just "User Personas" or "For AI practitioners and MLOps engineers"?

@kramaranya Did you get a chance to check it ?

Sounds good to me :) I'm leaning towards "User Personas".
Another option I was considering was "Personas: Platform Engineers and AI Practitioners", but "User Personas" seems a better option in case we change personas later again.
cc @andreyvelich @franciscojavierarceo @tenzen-y @Electronic-Waste any preferences?

Sure, User Personas make sense to me.

astefanutti · 2025-07-18T07:17:13Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.
+
+**The main goals of KF Trainer v2 include:**


It may be obvious for everyone involved in the project but it doesn't seem to me like very explicit / prominent in this article: PyTorch :)

I'd try to message that Kubeflow trainer v2 is the easiest and most scalable way to run PyTorch distributed training on Kubernetes!

Agree, emphasis that PyTorch is the primary framework for us makes sense.
Let's include this as one of the main goals.
WDYT @kramaranya @Electronic-Waste @tenzen-y @franciscojavierarceo ?

@kramaranya Did you get a chance to check it ?

Yeah, I do agree we should emphasize on this point.

I'm leaning toward modifying the current goal "Make AI/ML workloads easier to manage at scale" to be:
"Make AI/ML workloads easier to manage at scale, with PyTorch as the primary framework"

And then modify an intro:
"Running machine learning workloads on Kubernetes can be challenging. Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs."

Alternately, we could just add a new goal with no intro chnages:
"Deliver the easiest and most scalable PyTorch distributed training on Kubernetes"

What do you think @astefanutti @andreyvelich ?

@andreyvelich could you also take a look at ^^, so I can update it?

The description looks good.
For the goals, we can leave the goal to make aiml workloads easier to scale as it is, and just another goal for the PyTorch, as you said.

Awesome, updated in aa4f6e7. @astefanutti please let me know what you think :)

This looks great, thanks!

Signed-off-by: kramaranya <kramaranya15@gmail.com>

andreyvelich

Looks great, thanks a lot @kramaranya!

andreyvelich · 2025-07-20T14:10:18Z

_posts/2025-07-09-introducing-trainer-v2.md

+```
+
+Currently, **KF Trainer v2** supports the **Co-Scheduling plugin** from [Kubernetes scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins) project.
+**[Volcano scheduler support](https://github.com/kubeflow/trainer/pull/2672)** is coming in future releases to provide more advanced scheduling capabilities.


Let's say Volcano and KAI Scheduler: kubeflow/trainer#2663

thanks, updated:)

andreyvelich · 2025-07-20T14:10:37Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+The diagram above shows how this works in practice - the **KF Trainer** automatically **handles the SSH key generation** and **MPI communication** between training pods, which allows frameworks like DeepSpeed to coordinate training across multiple GPU nodes without requiring manual configuration of inter-node communication.
+
+# Fault Tolerance Improvements


Looks great, thanks! Just added comment about KAI

Signed-off-by: kramaranya <kramaranya15@gmail.com>

andreyvelich · 2025-07-20T15:17:05Z

Thanks for this huge effort @kramaranya!
/lgtm
/hold Let's merge it tomorrow!

/assign @tenzen-y @johnugeorge @terrytangyuan @astefanutti @franciscojavierarceo @Electronic-Waste @tarekabouzeid

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow · 2025-07-20T17:39:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Electronic-Waste

@kramaranya Thanks for this great work! Just one nits.

Electronic-Waste · 2025-07-21T08:59:40Z

.gitignore

@@ -11,3 +11,4 @@ _notebooks/.ipynb_checkpoints
 .netlify
 .tweet-cache
 __pycache__
+.idea


Suggested change

.idea

.idea

Need a new blank line here

Signed-off-by: kramaranya <kramaranya15@gmail.com>

astefanutti · 2025-07-21T10:25:16Z

/lgtm

Thanks!

eoinfennessy

This is great, thanks @kramaranya!

I added a few minor suggestions and clarifying questions.

eoinfennessy · 2025-07-21T13:50:06Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+Running machine learning workloads on Kubernetes can be challenging.
+Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge.
+The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.


Suggestion to make this more generic.

Suggested change

The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.

The **Kubeflow Trainer v2 (KF Trainer)** was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed machine learning jobs.

The PyTorch focus is intentional here because we want to emphasize that the main goal of Trainer v2 is specifically to make distributed PyTorch jobs easier to run, see #169 (comment)

_posts/2025-07-09-introducing-trainer-v2.md

eoinfennessy · 2025-07-21T14:10:01Z

_posts/2025-07-09-introducing-trainer-v2.md

+# Python SDK
+
+**The KF Trainer v2** introduces a **redesigned Python SDK**, which is intended to be the **primary interface for AI Practitioners**.
+The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.


What is meant by providing a unified interface across cloud environments?

Suggested change

The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.

The SDK provides the same interface for multiple ML frameworks, and abstracts the underlying complexities of Kubernetes and cloud environments.

This means you can use the same SDK commands and configurations for any cloud provider, without needing to learn different APIs for each platform. I think 'a unified interface' works better here, comparing to 'the same interface'. wdyt
https://www.kubeflow.org/docs/components/trainer/overview/#what-is-kubeflow-trainer

Unified interface makes sense for me.

_posts/2025-07-09-introducing-trainer-v2.md

eoinfennessy · 2025-07-21T14:19:05Z

_posts/2025-07-09-introducing-trainer-v2.md

+# Simplified API
+
+Previously, in the **Kubeflow Training Operator** users worked with different custom resources for each ML framework, each with their own framework-specific configurations.
+The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**.


Suggested change

The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**.

**Kubeflow Trainer v2** replaces these multiple CRDs with a **unified TrainJob CRD** that works with **multiple ML frameworks**.

To stay consistent, we should keep KF Trainer v2, and I would keep API to avoid duplication :)

eoinfennessy · 2025-07-21T14:27:58Z

_posts/2025-07-09-introducing-trainer-v2.md

+One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks.
+The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.


I was under the impression that the pipeline framework was introduced to make it easier for Kubeflow Trainer developers to support adding new frameworks to Trainer, and was not a user-facing change.

@andreyvelich @tenzen-y do we intend to document how users can implement custom plugins?

Suggest replacing "customers" with "users":

Suggested change

One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks.

The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.

The v2 architecture addresses this by introducing a **Pipeline Framework** that allows users to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.

Yes, the doc is work in progress by @IRONICBo is here: kubeflow/website#4039

@kramaranya @eoinfennessy Maybe we could be more explicit here, and say that allows platform administrators to extend the Plugins ... ?

makes sense to me

updated in 4280688

google-oss-prow · 2025-07-21T14:34:00Z

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

This is great, thanks @kramaranya!

I added a few minor suggestions and clarifying questions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow · 2025-07-21T15:02:34Z

New changes are detected. LGTM label has been removed.

kramaranya · 2025-07-21T15:23:28Z

_posts/2025-07-09-introducing-trainer-v2.md

+
+![user_personas](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)
+
+- **Platform Engineers** define and manage **the infrastructure configurations** required for training jobs using `TrainingRuntimes` or `ClusterTrainingRuntimes`. 


Hm, @andreyvelich should it actually be Platform Administrators?

Yes, let's keep the persona name consistent please.

thanks, updated in 4280688

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya added 2 commits July 14, 2025 07:25

blog: add post on introducing kubeflow trainer v2

b6433e9

Signed-off-by: kramaranya <kramaranya15@gmail.com>

blog: add front matter to trainer v2 introduction post

6caff2d

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow bot requested review from franciscojavierarceo and juliusvonkohout July 14, 2025 06:41

google-oss-prow bot added the size/L label Jul 14, 2025

franciscojavierarceo reviewed Jul 14, 2025

View reviewed changes

_posts/2025-07-09-introducing-trainer-v2.md Outdated Show resolved Hide resolved

tarekabouzeid reviewed Jul 14, 2025

View reviewed changes

_posts/2025-07-09-introducing-trainer-v2.md Outdated Show resolved Hide resolved

andreyvelich reviewed Jul 14, 2025

View reviewed changes

google-oss-prow bot requested review from astefanutti and deepanker13 July 14, 2025 20:02

Electronic-Waste reviewed Jul 15, 2025

View reviewed changes

kramaranya added 3 commits July 16, 2025 06:33

correct typos and refine text

5c4c5e5

Signed-off-by: kramaranya <kramaranya15@gmail.com>

add reference links

436c8e5

Signed-off-by: kramaranya <kramaranya15@gmail.com>

add .idea to .gitignore

3ee4680

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya force-pushed the trainer-v2 branch from 089df29 to 7f5ac01 Compare July 16, 2025 10:27

update personas

9ea9369

Signed-off-by: kramaranya <kramaranya15@gmail.com> Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya force-pushed the trainer-v2 branch from 7f5ac01 to 9ea9369 Compare July 16, 2025 10:29

update diagrams with new persona

68d1572

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya force-pushed the trainer-v2 branch from ad5ce75 to 68d1572 Compare July 16, 2025 12:47

andreyvelich reviewed Jul 17, 2025

View reviewed changes

andreyvelich mentioned this pull request Jul 17, 2025

chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 kubeflow/trainer#2743

Merged

terrytangyuan reviewed Jul 18, 2025

View reviewed changes

kramaranya added 4 commits July 18, 2025 07:14

add details for BuiltinTrainer

336b058

Signed-off-by: kramaranya <kramaranya15@gmail.com>

update user-personas diagram

c737840

Signed-off-by: kramaranya <kramaranya15@gmail.com>

update contributor list

a49c5ff

Signed-off-by: kramaranya <kramaranya15@gmail.com>

add my_train_func() sample

32bf7aa

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya force-pushed the trainer-v2 branch from 60fece3 to 32bf7aa Compare July 18, 2025 06:57

astefanutti reviewed Jul 18, 2025

View reviewed changes

add details to the blog

66869cf

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya force-pushed the trainer-v2 branch from dbe1392 to 66869cf Compare July 20, 2025 07:36

andreyvelich reviewed Jul 20, 2025

View reviewed changes

add gang-scheduling section

e36fdb3

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya force-pushed the trainer-v2 branch from f2cfa0f to e36fdb3 Compare July 20, 2025 15:04

google-oss-prow bot assigned astefanutti Jul 20, 2025

google-oss-prow bot added the do-not-merge/hold label Jul 20, 2025

google-oss-prow bot assigned Electronic-Waste, franciscojavierarceo, johnugeorge, tarekabouzeid, tenzen-y, andreyvelich and terrytangyuan Jul 20, 2025

google-oss-prow bot added the lgtm label Jul 20, 2025

add PyTorch goal

aa4f6e7

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow bot removed the lgtm label Jul 20, 2025

Electronic-Waste reviewed Jul 21, 2025

View reviewed changes

update .gitignore

a49ef63

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow bot added the lgtm label Jul 21, 2025

eoinfennessy approved these changes Jul 21, 2025

View reviewed changes

address wording comments

df22127

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow bot removed the lgtm label Jul 21, 2025

kramaranya commented Jul 21, 2025

View reviewed changes

update personas

4280688

Signed-off-by: kramaranya <kramaranya15@gmail.com>


		The diagram below shows how different personas interact with these custom resources:

		![division_of_labor](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)

	- Improve the Python interface
	- Provide a Pythonic interface to train models


		Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

		The main goals of KF Trainer v2 include:

	The main goals of KF Trainer v2 include:
	The main goals of Kubeflow Trainer v2 include:


		Trainer v2 leverages these Kubernetes-native improvements to re-use existing functionality and not reinventing the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes.

		# Division of Labor


		The diagram above shows how this works in practice - the KF Trainer automatically handles the SSH key generation and MPI communication between training pods, which allows frameworks like DeepSpeed to coordinate training across multiple GPU nodes without requiring manual configuration of inter-node communication.

		# Fault Tolerance Improvements

	The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.
	The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed machine learning jobs.

	The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.
	The SDK provides the same interface for multiple ML frameworks, and abstracts the underlying complexities of Kubernetes and cloud environments.

	The KF Trainer v2 replaces these multiple CRDs with a unified TrainJob API that works with multiple ML frameworks.
	Kubeflow Trainer v2 replaces these multiple CRDs with a unified TrainJob CRD that works with multiple ML frameworks.

		One of the challenges in KF Trainer v1 was supporting additional ML frameworks, especially for closed-sourced frameworks.
		The v2 architecture addresses this by introducing a Pipeline Framework that allows customers to extend the Plugins and support orchestration for their custom in-house ML frameworks.


		![user_personas](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)

		- Platform Engineers define and manage the infrastructure configurations required for training jobs using `TrainingRuntimes` or `ClusterTrainingRuntimes`.

blog: Add post on introducing Kubeflow Trainer V2 #169

Are you sure you want to change the base?

blog: Add post on introducing Kubeflow Trainer V2 #169

Conversation

kramaranya commented Jul 14, 2025

Uh oh!

Uh oh!

tarekabouzeid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

google-oss-prow bot commented Jul 14, 2025

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Jul 18, 2025 •

edited

Loading