Distributed ML Training and Fine-Tuning on Kubernetes
APACHE-2.0 License
Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, TensorFlow, HuggingFace, JAX, DeepSpeed, XGBoost, PaddlePaddle and others.
You can run high-performance computing (HPC) tasks with the Training Operator and MPIJob
since it
supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
please follow this guide to
install MPI Operator V2.
The Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using the Training Operator Python SDK.
Please check the official Kubeflow documentation for prerequisites to install the Training Operator.
Please follow the Kubeflow Training Operator guide for the detailed instructions on how to install Training Operator.
Run the following command to install the latest stable release of the Training Operator control plane: v1.8.0
.
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0"
Run the following command to install the latest changes of the Training Operator control plane:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
The Training Operator implements a Python SDK to simplify creation of distributed training and fine-tuning jobs for Data Scientists.
Run the following command to install the latest stable release of the Training SDK:
pip install -U kubeflow-training
Please refer to the getting started guide to quickly create your first distributed training job using the Python SDK.
If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow the PyTorchJob MNIST guide.
The following links provide information on how to get involved in the community:
#kubeflow-training
Slack channel.This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.
Please refer to the CONTRIBUTING guide.
Please refer to the CHANGELOG.
The following table lists the most recent few versions of the operator.
Operator Version | API Version | Kubernetes Version |
---|---|---|
v1.4.x |
v1 |
1.23+ |
v1.5.x |
v1 |
1.23+ |
v1.6.x |
v1 |
1.23+ |
v1.7.x |
v1 |
1.25+ |
v1.8.x |
v1 |
1.27+ |
latest (master HEAD) |
v1 |
1.27+ |
For a complete reference of the custom resource definitions, please refer to the API Definition.
For details on the Training Operator custom resources APIs, refer to the following API documentation
This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.