training-operator | Python Ecosystem Directory

Bot releases are hidden (Show)

training-operator - v1.8.0-rc.0 release Latest Release

Published by johnugeorge 6 months ago

New features

Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
Adding Training image needed for train api #1963 (deepanker13)
[SDK] Train API #1962 (deepanker13)
Train api dataset download changes #1959 (deepanker13)
Train api init container creation #1958 (deepanker13)
Publish trainer hugging face image #1985 (deepanker13)
Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
Implement webhook validation for the TFJob #2051 (tenzen-y)
Implement webhook warnings for the MXJob #2058 (tenzen-y)
Implement webhook validations for the PaddleJob #2057 (tenzen-y)
Fail job for non-retryable exit codes #2071 (kellyaa)
Adding fine tune example with s3 as the dataset store #2006 (deepanker13)

Bug fixes

fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
Upgrade controller-gen to v0.14.0 #2026 (champon1020)
Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
Fix URL in python SDK setup.py #2011 (garymm)

Misc

Adding parallel support for coveralls #1956 (johnugeorge)
torchrun example with cpu version pytorch #1965 (kuizhiqing)
[SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Add information about TrainingClient logging #1973 (andreyvelich)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
E2E: Replace outdated images with latest ones #2083 (tenzen-y)
Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)

training-operator - v1.7.0 release

Published by johnugeorge 12 months ago

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 https://github.com/kubeflow/training-operator/pull/1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

training-operator - v1.7.0-rc.0 release

Published by johnugeorge about 1 year ago

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 https://github.com/kubeflow/training-operator/pull/1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

training-operator - v1.6.0 release

Published by johnugeorge over 1 year ago

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

New Features

Support for k8s v1.25 in CI #1684 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
Adopting coschduling plugin #1724 (tenzen-y)
Support for Paddlepaddle #1675 (kuizhiqing)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
[SDK] Create Unify Training Client #1719 (andreyvelich)

Bug fixes

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Fix XGBoost conditions bug #1737 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Fix status lost #1697 (ggaaooppeenngg)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)

Misc

Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
Fix Python installation in CI #1759 (tenzen-y)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Update join Slack link #1750 (Syulin7)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
Add Yuki to reviewer group #1739 (johnugeorge)
Trim down CRD descriptions #1735 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Removing deprecated Job Labels #1702 (johnugeorge)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Update deployment.yaml #1668 (OmriShiv)
Upgrade Go version to v1.19 #1663 (tenzen-y)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
*job API(master) cannot compatible with old job #1725
Support coscheduling plugin #1722
Number of worker threads used by the controller can't be configured #1706
Conformance: Training tests #1698
PyTorch and MPI Operator pulls hardcoded initContainer #1696
PaddlePaddle Training: why can't find pods #1694
Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
[SDK] Create unify client for all Training Job types #1691
Support Kubernetes v1.25 #1682
panic happened when add podgroup watch #1679
OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
Change Kubernetes version for test #1665
Support for multiplatform container imege (amd64 and arm64) #1664
Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
After setting hostNetwork to true, mpi does not work #1657
What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
When will MPIJob support v2beta1 version? #1653
Kubernetes HPA doesn't work with elastic PytorchJob #1645
training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
Training operator fails to create HPA for TorchElastic jobs #1626
Release v1.5.0 tracking #1622
upgrade client-go #1599
trainning-operator may need to monitor PodGroup #1574
Error: invalid memory address or nil pointer dereference #1553
The pytorchJob training is slow #1532
pytorch elastic scheduler error #1504

training-operator - v1.6.0-rc.1 release

Published by johnugeorge over 1 year ago

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Fix Python installation in CI #1759 (tenzen-y)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Update join Slack link #1750 (Syulin7)
Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
Add Yuki to reviewer group #1739 (johnugeorge)
Fix XGBoost conditions bug #1737 (tenzen-y)
Add E2E test for gang-scheduling #1736 (tenzen-y)
Trim down CRD descriptions #1735 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
Adopting coschduling plugin #1724 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
[SDK] Create Unify Training Client #1719 (andreyvelich)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
Removing deprecated Job Labels #1702 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Fix status lost #1697 (ggaaooppeenngg)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Support for k8s v1.25 in CI #1684 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
[PaddlePaddle] support paddlejob #1675 (kuizhiqing)
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Update deployment.yaml #1668 (OmriShiv)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
Upgrade Go version to v1.19 #1663 (tenzen-y)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
*job API(master) cannot compatible with old job #1725
Support coscheduling plugin #1722
Number of worker threads used by the controller can't be configured #1706
Conformance: Training tests #1698
PyTorch and MPI Operator pulls hardcoded initContainer #1696
PaddlePaddle Training: why can't find pods #1694
Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
[SDK] Create unify client for all Training Job types #1691
Support Kubernetes v1.25 #1682
panic happened when add podgroup watch #1679
OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
Change Kubernetes version for test #1665
Support for multiplatform container imege (amd64 and arm64) #1664
Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
After setting hostNetwork to true, mpi does not work #1657
What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
When will MPIJob support v2beta1 version? #1653
Kubernetes HPA doesn't work with elastic PytorchJob #1645
training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
Training operator fails to create HPA for TorchElastic jobs #1626
Release v1.5.0 tracking #1622
upgrade client-go #1599
trainning-operator may need to monitor PodGroup #1574
Error: invalid memory address or nil pointer dereference #1553
The pytorchJob training is slow #1532
pytorch elastic scheduler error #1504

training-operator - v1.6.0-rc.0 release

Published by johnugeorge over 1 year ago

v1.6.0-rc.0 release

training-operator - v1.5.0 release

Published by johnugeorge about 2 years ago

Full Changelog

New Features

Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob #1610 (tenzen-y)
Add all generation tools to Makefile #1609 (johnugeorge)
Adding MPI python sdk #1608 (johnugeorge)
Adding XGboost Python sdk #1607 (johnugeorge)
Generating MPI python sdk #1606 (johnugeorge)
Update k8s dependencies to v0.24.1 #1604 (johnugeorge)
Migrate test framework to GHA #1603 (johnugeorge)
Add mpi in update-codegen.sh #1600 (ggaaooppeenngg)
MXNet SDK with Status check fix #1618 (johnugeorge)

Bug Fixes

fix: MPIJob worker still running when NotEnoughResources #1621 (hackerboy01)
fix comments for pytorch-controller #1620 (hackerboy01)
fix: requeue when expire time is not up yet #1614 (Garrybest)
Look for fully-qualified job role label in Python sdk #1588 (person142)
fix torch env typo #1573 (kuizhiqing)
Restart job on failure for Always,OnFailure Policy #1572 (georgkaleido)
Increase success threshold #1568 (haoxins)
update status.startTime for pytorchjob and xgboostjob #1567 (cheimu)
fix: add mpijobs to kubeflow training role #1565 (henrysecond1)
fix Pytorjob status inaccuracy when task replica scale down #1593 (PeterChg)
fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set #1557 (cheimu)
fix api reader issue #1551 (zw0610)
fix label and CleanPodPolicy for mpi-controller #1550 (zw0610)
fix UpdateJobStatusInApiServer when gang-scheduling is enabled #1549 (zw0610)
fix: add namespace filtering when getting pods/services for jobs #1545 (henrysecond1)
fix: set mpijob runPolicy.cleanPodPolicy to default none #1554 (cheimu)

Misc

Update training controller image to latest #1625 (johnugeorge)
Update SDK version to 1.5.0 #1624 (johnugeorge)
Upgrade common to v0.4.3 #1623 (johnugeorge)
Adding GHA for automatic image build and push #1615 (johnugeorge)
Remove presubmit test depending on optional-test-infra #1596 (aws-kf-ci-bot)
chore: stop action on first fail #1595 (jasonliu747)
update img url in design doc #1591 (zw0610)
Remove uncalled mpi-controller DeletePodsAndServices() #1558 (cheimu)
Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy #1556 (cheimu)
Remove table-logger dependency #1544 (person142)
Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf_operator #1542 (dependabot[bot])

training-operator - v1.5.0-rc.0 release

Published by johnugeorge over 2 years ago

Full Changelog

Closed issues:

MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
unable to fetch TFJob when I use client.go run tfjob #1612
Pytorchjob dist-mnist no training logs #1601
kubectl get tfjob -o yaml, but not status output #1598
missing image in tf_job_design_doc.md #1590
Labels in Python client are out of date #1587
PyTorchJob Pods "Not Ready" After Completing Training #1577
cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value #1576
PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570
pytorchjob doesn't have status.startTIme. #1566
Optional-test-infra Deprecation Notice - Training #1561
Should we update MPIJob unit test CleanPodPolicy field? #1555
--enable-gang-scheduling=true doesn't work for MPIJob #1548
PyTorchJob fails when creating a task with a different namespace but the same name #1543
Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: "null" after enable-gang-scheduling #1538
Job TTLs not working #1533
Support PodGroup in scheduler-plugins/coscheduling #1518
support elastic training #1515
Modified the configuration of RootLogger #1514
Add checking import order in CI #1510
Scale down of pytorchJob cause workers pod to restart #1509
Support label selector based success/failure conditions #1507
[feat] Support SuccessPolicy in PyTorchJob #1505
pytorch elastic scheduler error #1504
Could you add the example of MPIJob in this repository #1502
[Feature] Create a Informer/ClientSet for PyTorch Jobs #1499
[feature] Make init container injection logic availabel to all jobs #1498
Roadmaps for 1.4 release #1496
[bug] (MpiJob)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. #1494
Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
Python PytorchJob: no attribute openapi_types for example code #1481
PyTorch DistributedDataParallel training with multi nodes #1475
Installing kubeflow-training breaks import for other kubeflow packages (katib, fairing, etc.) #1471
Deprecate ksonnet and use python/golang to submit jobs #1468
Help Wanted in ParameterServerStrategy Example. #1459
Bug: SomeTimes Coredumped using tfjob #1456
[question] PyTorchJob MNIST example training speed #1454
tfjob status not match when EnableDynamicWorker set true #1452
training-operator set scheduler error #1447
[sdk]: Replace TableLogger component in the SDK for better support with ipykernel>=6.x #1446
SDK: wait_for_job reports typeError #1445
Update prometheus monitoring doc #1443
Master branch should provide a nightly image #1433
Clean up test folder before testing #1429
Clean up TF specific docs #1424
[feature] Support SchedulingPolicy in PyTorchJob #1414
Hyperlinks in the "Overview" section is incorrect/not found #1411
add workqueue metric #1407
Validation fails for MXJob Tune example #1402
Rate exceeded for aws ecr image #1400
change layout to follow the standard of kubebuilder? #1397
[example] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist #1393
Update kubeflow/website for 1.4 release #1392
Cut beta release of tf-operator for 1.4 release #1385
"invalid memory address or nil pointer dereference" #1382
some questions about job sync #1379
Provides a default Grafana dashboard #1376
[feature] Support different PS/worker types #1369
Need to copy all (mainly pytorch) framework's example dir to tf-operator/examples #1366
Add more CRD validations markers to block invalid job on client apply #1363
Update presubmit and post submit job triggers #1354
Optimize post submit jobs flow #1353
Enable leader election in controller manager using controllermanagerconfig #1350
Support mpi jobs in universal operator #1345
post-submit job failure in master branch #1343
Improve observability of universal operator #1340
Best practice to organize main.go and Dockerfile? #1333
Should training operator keep clientset in the same repository? #1332
Test image has incorrect tag? #1329
Prepare e2e tests for all frameworks #1323
Reduce e2e replica-restart-policy-tests running time #1319
Improve logs structure by consolidating libs from controller runtime and controllers #1313
Enable tests for all frameworks #1311
[bug] The pod wil be recreated until the expectation expires #1306
Upgrade CRDs to apiextensions.k8s.io/v1 #1304
Add role details as new columns to kubectl get jobs output for CRD. #1301
How to handle long pending pods in a TF-job? #1282
Could you release a new version of Python SDK #1279
Update swagger.json schema for TFJobSpec to include RunPolicy #1278
Not able to pass environment variable from tfjob to pod #1273
v1_time.py is not generated by hack/python-sdk/gen-sdk.sh #1271
Add a step to upload artifact #1258
[feature] Support multi port in TFJob #1251
[feat] Add scale subresource #1220
Pod get re-created after it exited and get garbage collected #1186
Clean up vendor dependencies #1162

Merged pull requests:

Update training controller image to latest #1625 (johnugeorge)
Update SDK version to 1.5.0 #1624 (johnugeorge)
Upgrade common to v0.4.3 #1623 (johnugeorge)
fix: MPIJob worker still running when NotEnoughResources #1621 (hackerboy01)
fix comments for pytorch-controller #1620 (hackerboy01)
MXNet SDK with Status check fix #1618 (johnugeorge)
Adding GHA for automatic image build and push #1615 (johnugeorge)
fix: requeue when expire time is not up yet #1614 (Garrybest)
Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob #1610 (tenzen-y)
Add all generation tools to Makefile #1609 (johnugeorge)
Adding MPI python sdk #1608 (johnugeorge)
Adding XGboost Python sdk #1607 (johnugeorge)
Generating MPI python sdk #1606 (johnugeorge)
Update k8s dependencies to v0.24.1 #1604 (johnugeorge)
Migrate test framework to GHA #1603 (johnugeorge)
Add mpi in update-codegen.sh #1600 (ggaaooppeenngg)
Remove presubmit test depending on optional-test-infra #1596 (aws-kf-ci-bot)
chore: stop action on first fail #1595 (jasonliu747)
fix Pytorjob status inaccuracy when task replica scale down #1593 (PeterChg)
update img url in design doc #1591 (zw0610)
Look for fully-qualified job role label in Python sdk #1588 (person142)
fix torch env typo #1573 (kuizhiqing)
Restart job on failure for Always,OnFailure Policy #1572 (georgkaleido)
Increase success threshold #1568 (haoxins)
update status.startTime for pytorchjob and xgboostjob #1567 (cheimu)
fix: add mpijobs to kubeflow training role #1565 (henrysecond1)
Remove uncalled mpi-controller DeletePodsAndServices() #1558 (cheimu)
fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set #1557 (cheimu)
Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy #1556 (cheimu)
fix: set mpijob runPolicy.cleanPodPolicy to default none #1554 (cheimu)
fix api reader issue #1551 (zw0610)
fix label and CleanPodPolicy for mpi-controller #1550 (zw0610)
fix UpdateJobStatusInApiServer when gang-scheduling is enabled #1549 (zw0610)
fix: add namespace filtering when getting pods/services for jobs #1545 (henrysecond1)
Remove table-logger dependency #1544 (person142)
Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf_operator #1542 (dependabot[bot])
Release Python SDK 1.4.0 #1541 (alembiewski)
mod: Upgrade ginkgo to v2 #1537 (haoxins)
docs: Fix broken links in quick-start-v1.md #1536 (nakamasato)
extends path in __init__.py for SDK correctly #1531 (cakeislife100)
chore: Update changelog for v1.4.0-rc.0 release #1528 (terrytangyuan)

training-operator - v1.4.0

Published by johnugeorge over 2 years ago

Full Changelog

Merged pull requests:

extends path in __init__.py for SDK correctly #1531 (cakeislife100)
Update manifests with latest image tag #1527 (johnugeorge)
add option for mpi kubectl delivery #1525 (zw0610)
restore option namespace in launch arguments #1524 (zw0610)
remove unused scripts #1521 (zw0610)
remove ChanYiLin from approvers #1513 (ChanYiLin)
add StacktraceLevel for zapr #1512 (qiankunli)
add unit tests for tensorflow controller #1511 (zw0610)
add the example of MPIJob #1508 (hackerboy01)
Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
Fix a typo in mpi controller log #1495 (LuBingtan)
feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
chore: Fix GitHub Actions script #1491 (tenzen-y)
chore: Fix missspell in tfjob #1490 (tenzen-y)
chore: Update OWNERS #1489 (gaocegege)
Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
fix comments for mpi-controller #1485 (hackerboy01)
add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
Add MPI job to README now that it's supported #1480 (terrytangyuan)
add mpi doc #1477 (zw0610)
Set Go version of base image to 1.17 #1476 (tenzen-y)
update label for tf-controller #1474 (zw0610)
Add Akuity to the list of adopters #1473 (terrytangyuan)
Add PR template with doc checklist #1470 (andreyvelich)
Add e2e failure debugging guidance #1469 (Jeffwan)
chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
Minor edits to README.md #1460 (terrytangyuan)
add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
feat(pytorch): Support elastic training #1453 (gaocegege)
fix: generate printer columns for job crds #1451 (henrysecond1)
Fix README typo #1450 (davidxia)
consistent naming for better readability #1449 (pramodrj07)
Fix set scheduler error #1448 (qiankunli)
Add CI to run the tests for Go #1440 (tenzen-y)
fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
Generate a single swagger.json file for all frameworks #1437 (alembiewski)
Update links and files with the new URL #1434 (andreyvelich)
chore: update CHANGELOG.md #1432 (Jeffwan)
Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
Add Cisco to Adopters List #1421 (andreyvelich)
Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
docs: Move myself to approvers #1419 (terrytangyuan)
fix hyperlinks in the 'overview' section #1418 (pramodrj07)
docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
Update scripts to generate sdk for all frameworks #1389 (Jeffwan)

Closed issues:

Question: What is the recommended way for Data Scientists to run a distributed training job #1535
Restore KUBEFLOW_NAMESPACE options #1522
Improve test coverage #1497
swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
[bug] Missing init container in PyTorchJob #1482
PytorchJob DDP training will stop if I delete a worker pod #1478
Write down e2e failure debug process #1467
How can i add the Priorityclass to the TFjob？ #1466
github.com/go-logr/zapr.(*zapLogger).Error #1444
Display coverage % in GitHub actions list #1442
Add Go test to CI #1436
Podgroup is constantly created and deleted after tfjob is success or failure #1426
Cut official release of 1.3.0 #1425
Add "not maintained" notice to other operator repos #1423
Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
Python SDK for Kubeflow Training Operator #1380
Rename this repo #1348
Universal Operator Phase III: Graduate operator to production grade #1318

training-operator - v1.4.0-rc.0 release

Published by johnugeorge over 2 years ago

Full Changelog

Features and improvements:

Display coverage % in GitHub actions list #1442
Add Go test to CI #1436

Fixed bugs:

[bug] Missing init container in PyTorchJob #1482
Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381

Closed issues:

Restore KUBEFLOW_NAMESPACE options #1522
Improve test coverage #1497
swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
PytorchJob DDP training will stop if I delete a worker pod #1478
Write down e2e failure debug process #1467
How can i add the Priorityclass to the TFjob？ #1466
github.com/go-logr/zapr.(*zapLogger).Error #1444
Podgroup is constantly created and deleted after tfjob is success or failure #1426
Cut official release of 1.3.0 #1425
Add "not maintained" notice to other operator repos #1423
Python SDK for Kubeflow Training Operator #1380

Merged pull requests:

Update manifests with latest image tag #1527 (johnugeorge)
add option for mpi kubectl delivery #1525 (zw0610)
restore option namespace in launch arguments #1524 (zw0610)
remove unused scripts #1521 (zw0610)
remove ChanYiLin from approvers #1513 (ChanYiLin)
add StacktraceLevel for zapr #1512 (qiankunli)
add unit tests for tensorflow controller #1511 (zw0610)
add the example of MPIJob #1508 (hackerboy01)
Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
Fix a typo in mpi controller log #1495 (LuBingtan)
feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
chore: Fix GitHub Actions script #1491 (tenzen-y)
chore: Fix missspell in tfjob #1490 (tenzen-y)
chore: Update OWNERS #1489 (gaocegege)
Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
fix comments for mpi-controller #1485 (hackerboy01)
add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
Add MPI job to README now that it's supported #1480 (terrytangyuan)
add mpi doc #1477 (zw0610)
Set Go version of base image to 1.17 #1476 (tenzen-y)
update label for tf-controller #1474 (zw0610)
Add Akuity to the list of adopters #1473 (terrytangyuan)
Add PR template with doc checklist #1470 (andreyvelich)
Add e2e failure debugging guidance #1469 (Jeffwan)
chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
Minor edits to README.md #1460 (terrytangyuan)
add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
feat(pytorch): Support elastic training #1453 (gaocegege)
fix: generate printer columns for job crds #1451 (henrysecond1)
Fix README typo #1450 (davidxia)
consistent naming for better readability #1449 (pramodrj07)
Fix set scheduler error #1448 (qiankunli)
Add CI to run the tests for Go #1440 (tenzen-y)
fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
Generate a single swagger.json file for all frameworks #1437 (alembiewski)
Update links and files with the new URL #1434 (andreyvelich)
chore: update CHANGELOG.md #1432 (Jeffwan)
Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
Add Cisco to Adopters List #1421 (andreyvelich)
Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
docs: Move myself to approvers #1419 (terrytangyuan)
fix hyperlinks in the 'overview' section #1418 (pramodrj07)
docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
Update scripts to generate sdk for all frameworks #1389 (Jeffwan)

training-operator - v1.3.0

Published by Jeffwan about 3 years ago

v1.3.0 (2021-10-03)

Full Changelog

Features

Feature/support pytorchjob set queue of volcano (#1415, @qiankunli)
Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta (#1409, @Jeffwan)
add api-doc for all frameworks (#1370, @DeliangFan)
Update training operator release process (#1347, @Jeffwan)
add option to enable schemes (#1342, @zw0610)
Don't use inline for runPolicy and register defaulter function (#1330, @Jeffwan)
Use low-level controller and handlers in SetupWithManager (#1315, @Jeffwan)
Reorganize controllers code structure (#1302, @Jeffwan)
Unify code structure of training job api (#1300, @Jeffwan)
Extract reusable codes to common utility (#1297, @Jeffwan)
Add MXJob v1 api and controller (#1296, @Jeffwan)
add tf reconciler to the unified operator (#1295, @zw0610)
Add XGBoost controller (#1293, @Jeffwan)
add pytorch API and controller (#1294, @zw0610)
Generate tfjob 1.19.x compatible clientset (#1290, @Jeffwan)
add XGBoostJob api (#1286, @zw0610)

Bug fixes

fix hyperlinks in the 'overview' section (#1418, @pramodrj07)
2010: fix to expose correct monitoring port #1405 (deepak-muley)
Fix 1399: added pod matching label in service selector #1404 (deepak-muley)
fix: runPolicy validation error in the examples #1401 (Jeffwan)
fix: volcano pod group creation issue (#1390, @Jeffwan)
Fix 1340 prometheus counters (#1375, @deepak-muley)
fix makefile to store crds in a separate folder (#1368, @deepak-muley)
Fix copyright header for some files (#1371, @DeliangFan)
fix incorrect torch env population (#1361, @Jeffwan)
fix: Resolve scheme registration issue for defaulters (#1360, @Jeffwan)
Fix XGBoost container name in log message (#1362, @andreyvelich)
Fix postsubmit job using PULL_BASE_SHA (#1344, @Jeffwan)
Fix all client request that needs contexts (#1292, @Jeffwan)

Misc

Update manifests with latest image tag #1406 (johnugeorge)
Add simple verification jobs (#1391, @Jeffwan)
chore: Bump kubeflow/common version to 0.3.7 (#1388, @Jeffwan)
chore(doc): Update README.md (#1387, @Jeffwan)
Remove lagacy tf-operator from the codebase (#1378, @thunderboltsid)
chore: Update manifest image tag (#1364, @Jeffwan)
add example for mxnet and pytorch (#1373, @DeliangFan)
docs: Update to use kubectl kustomize in instructions (#1356, @terrytangyuan)
Clean up manifests and remove unused files (#1349, @Jeffwan)
1322: Modified manifests to use all-in-one training-operator (#1346, @deepak-muley)
chore: Update changelog.md (#1339, @Jeffwan)
Move root level docs to docs folder (#1338, @Jeffwan)
Temporarily add Jeffwan@ to OWNERS (#1287, @Jeffwan)

Testing

Update include dirs in prow config (#1374, @andreyvelich)
Enable e2e test against universal operator (#1336, @Jeffwan)
Use PULL_PULL_SHA for image (#1334, @PatrickXYS)

training-operator - v1.3.0-rc.2

Published by Jeffwan about 3 years ago

v1.3.0-rc.2 (2021-09-20)

Full Changelog

Closed issues:

[bug] Unable to specify pod template metadata for TFJob #1403

Merged pull requests:

Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1410 (Jeffwan)
chore: Update image tags in manifests #1412 (Jeffwan)

training-operator - v1.3.0-rc.1

Published by johnugeorge about 3 years ago

v1.3.0-rc.1 (2021-09-15)

Full Changelog

Closed issues:

[bug] Reconcilation fails when upgrading common to 0.3.6 #1394

Merged pull requests:

Update manifests with latest image tag #1406 (johnugeorge)
2010: fix to expose correct monitoring port #1405 (deepak-muley)
Fix 1399: added pod matching label in service selector #1404 (deepak-muley)
fix: runPolicy validation error in the examples #1401 (Jeffwan)

training-operator - v1.3.0-rc.0

Published by Jeffwan about 3 years ago

v1.3.0-rc.0 (2021-08-31)

Full Changelog

Features

add api-doc for all frameworks (#1370, @DeliangFan)
Update training operator release process (#1347, @Jeffwan)
add option to enable schemes (#1342, @zw0610)
Don't use inline for runPolicy and register defaulter function (#1330, @Jeffwan)
Use low-level controller and handlers in SetupWithManager (#1315, @Jeffwan)
Reorganize controllers code structure (#1302, @Jeffwan)
Unify code structure of training job api (#1300, @Jeffwan)
Extract reusable codes to common utility (#1297, @Jeffwan)
Add MXJob v1 api and controller (#1296, @Jeffwan)
add tf reconciler to the unified operator (#1295, @zw0610)
Add XGBoost controller (#1293, @Jeffwan)
add pytorch API and controller (#1294, @zw0610)
Generate tfjob 1.19.x compatible clientset (#1290, @Jeffwan)
add XGBoostJob api (#1286, @zw0610)

Bug fixes

fix: volcano pod group creation issue (#1390, @Jeffwan)
Fix 1340 prometheus counters (#1375, @deepak-muley)
fix makefile to store crds in a separate folder (#1368, @deepak-muley)
Fix copyright header for some files (#1371, @DeliangFan)
fix incorrect torch env population (#1361, @Jeffwan)
fix: Resolve scheme registration issue for defaulters (#1360, @Jeffwan)
Fix XGBoost container name in log message (#1362, @andreyvelich)
Fix postsubmit job using PULL_BASE_SHA (#1344, @Jeffwan)
Fix all client request that needs contexts (#1292, @Jeffwan)

Misc

Add simple verification jobs (#1391, @Jeffwan)
chore: Bump kubeflow/common version to 0.3.7 (#1388, @Jeffwan)
chore(doc): Update README.md (#1387, @Jeffwan)
Remove lagacy tf-operator from the codebase (#1378, @thunderboltsid)
chore: Update manifest image tag (#1364, @Jeffwan)
add example for mxnet and pytorch (#1373, @DeliangFan)
docs: Update to use kubectl kustomize in instructions (#1356, @terrytangyuan)
Clean up manifests and remove unused files (#1349, @Jeffwan)
1322: Modified manifests to use all-in-one training-operator (#1346, @deepak-muley)
chore: Update changelog.md (#1339, @Jeffwan)
Move root level docs to docs folder (#1338, @Jeffwan)
Temporarily add Jeffwan@ to OWNERS (#1287, @Jeffwan)

Testing

Update include dirs in prow config (#1374, @andreyvelich)
Enable e2e test against universal operator (#1336, @Jeffwan)
Use PULL_PULL_SHA for image (#1334, @PatrickXYS)

training-operator - v1.2.1

Published by gaocegege about 3 years ago

v1.2.1 (2021-08-27)

Full Changelog

Closed issues:

volcano scheduler with customized queue #1377
[chore] Delete the all-in-one branch #1372
Fix scheme registration issue #1359
Distributive Gloo PyTorchJob example doesn't work #1358
Enable CI pipeline against v1.2-branch #1355
Generate API Documentation for all frameworks #1341
make test failed due to invalid crd schema #1324
Cut 1.2.0 tag and release a stable version of master #1321
Copyright header is not correctly generated #1309

Merged pull requests:

fix(init): Fix crash problem when enabling gang scheduling #1384 (gaocegege)

training-operator - v1.3.0-alpha.3

Published by Jeffwan about 3 years ago

v1.3.0 will be the first release version to support tensorflow, pytorch, mxnet and xgboost distributed training jobs.
More background can be found in design doc All-in-one Kubeflow Training Operator

Install Kubeflow training operator by running:

 kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.3.0-alpha.3"

require kubectl >= 1.21.x

training-operator - v1.3.0-alpha.2

Published by Jeffwan about 3 years ago

v1.3.0 will be the first release version to support tensorflow, pytorch, mxnet and xgboost distributed training jobs.
More background can be found in design doc All-in-one Kubeflow Training Operator

Install Kubeflow training operator by running:

 kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.3.0-alpha.2"

require kubectl >= 1.21.x

training-operator - v1.3.0-alpha.1

Published by Jeffwan about 3 years ago

v1.3.0 will be the first release version to support tensorflow, pytorch, mxnet and xgboost distributed training jobs.
More background can be found in design doc All-in-one Kubeflow Training Operator

Install Kubeflow training operator by running:

 kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.3.0-alpha.2"

require kubectl >= 1.21.x

training-operator - v1.2.0 release

Published by Jeffwan about 3 years ago

v1.2.0 (2021-08-03)

Full Changelog

Features

Add job namespace to tf_operator_jobs_* counters (#1283, @alembiewski)
feat: upgrade kubeflow common and volcano version (#1276, @shinytang6)
Add task type annotation for pods when EnableGangScheduling is true. (#1268, @jiangkaihua)

Bug fixes

Fix invalid pointer when tfjob is deleted (#1285, @johnugeorge)
fix get_logs pod_names type and iteration blocking (#1280, @Windfarer)
fix calling custom_api.delete_namespaced_custom_object args error (#1281, @Windfarer)
fix: Remove the dup comment tag (#1274, @gaocegege)
Fix: Remove Github CD workflow (#1263, @PatrickXYS)
Fix: the "follow" of TFJobClient.get_logs (#1254, @Windfarer)

Misc

Update container image for v1.1.1 (#1328, @Jeffwan)
add a specific version of tensorflow_datasets (#1305, @jazzsir)
Remove vendor folder (#1288, @Jeffwan)
add podgroups rule in cluster-role.yaml (#1272, @huone1)
Use remote Kustomize build option in standalone installation instructions (#1266, @verult)

training-operator - v1.1.0 release

Published by Jeffwan over 3 years ago

This is a large official release since v0.5.3. Please give more feedbacks. Thanks for all contributors.

Features

feat: Remove k8s.io/kubernetes (#1235, @gaocegege)
Migrate to public ECR (#1256, @PatrickXYS)
feat: Add API Documentation WIP (#1249, @gaocegege)
feat: Update developers guide and readme (#1244, @gaocegege)
Move TF Operator e2e tests to AWS Prow (#1204, @ChanYiLin)
crd definition support multiple evaluator (#1240, @oikomi)
support multiple evaluators (#1239, @oikomi)
feat: Change the message for running condition (#1230, @gaocegege)
feat(server): Use apiextension client to check if crd exists (#1228, @gaocegege)
checkCRDExists func return true when k8s cluster is not connected (#1207, @oikomi)
feat: Add CD using GitHub Actions (#1196, @gaocegege)
Migrate controller implementation to kubeflow/common fashion (#1171, @ChanYiLin)
Support success policy for TFJob (#1165, @terrytangyuan)
add distributed training example of using TF 2.1 Strategy API (#1164, @jazzsir)
Set completion time when job exceed specified deadline. (#1150, @SimonCqk)
Support ClusterSpec Propagation Feature in TF 1.14 (#1149, @zhujl1991)
Add watch function for TFJob python Client API (#1122, @jinchihe)
Enhance tfjobs sdk docs (#1114, @jinchihe)
Generate TFJob Python SDK (#1103, @jinchihe)
feat: Support pprof when monitoring is specified (#1102, @gaocegege)
feat: Use kubeflow/common (#1088, @gaocegege)
Add support for aarch64 (#1098, @MrXinWang)
feat: Do not set TF_CONFIG for local training (#1080, @gaocegege)
feat: Replace gometalinter with golangci-lint (#1081, @gaocegege)
Add controller-name label for Pod and service (#1067, @hougangliu)
Add qps and burst options (#1063, @ScorpioCPH)
Avoid unnecessary update when tfjob is complete (#1051, @cheyang)
set annotation automatically when EnableGangScheduling is set to true (#1032, @ChanYiLin)
feat(pod): Support custom gang scheduler via CLI argument (#1050, @gaocegege)

Bug fixes

Fix kubeflow overlay (#1260, @PatrickXYS)
fix: Do not validate evaluator (#1238, @gaocegege)
fix: Remove default resync period (#1237, @gaocegege)
fix: Observe the creation when failed to create the pod (#1236, @gaocegege)
fix: Remove vendor cp command (#1232, @gaocegege)
Fix completion time setting bug (#1226, @shaowei-su)
feat(deploy): Add standalone deployment yaml (#1218, @gaocegege)
Fix updateStatus no worker Crashoff (#1215, @kuikuikuizzZ)
fix: Fix the log message (#1203, @gaocegege)
Fix the typo (#1178, @pingsutw)
Fix setup cluster issue and Pylint issue in CI tests (#1179, @jinchihe)
Fix the link to run_e2e_workflow.py script (#1154, @terrytangyuan)
Fix evaluator runconfig (#1146, @richardsliu)
Fix sdk test issue that's caused by kubenertes Client bug. (#1143, @jinchihe)
fix(controller): calculate satisfied with && instead of || (#1120, @GuoHaiqing)
fix comment, add +optional flag to comment. (#1137, @EDGsheryl)
fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured (#1118, @leileiwan)
fix the reconcile flow (#1111, @ChanYiLin)
Fix example Mnist With Summaries (#1073, @andreyvelich)
fix bug: When executing tf-operator.v1 -version, GitSHA is always 'not provided' (#1046, @asdfsx)
fix(UI): show correct namespace and name when deleting job through dashboard (#1044, @gbin10533)
Minor fix to add CoreV1 to scheme (#1037, @johnugeorge)
fix(docs): Fix link for simple_TFJob_test (#1038, @gaocegege)
fix: Remove dup code (#1022, @gaocegege)

Chores

tf-operator: Consolidate manifests (#1255, @yanniszark)
TFJob Operator: Move manifests development upstream (#1247, @yanniszark)
Update vendor as kubeflow/common is updated. (#1252, @jiangkaihua)
docs: Add Ant Group to ADOPTERS.md (#1243, @terrytangyuan)
chore: Add tencent cloud (#1234, @gaocegege)
add vip (#1233, @oikomi)
chore: Update changelog (#1227, @gaocegege)
Update kubeflow common to 0.3.2 (#1225, @shaowei-su)
chore: Remove useless expectation (#1217, @gaocegege)
chore: Update codegen (#1211, @gaocegege)
add Evaluator type for CRD example (#1209, @oikomi)
add err log for create client set failed and code minor optimization (#1210, @oikomi)
chore: Remove the kanban update workflow (#1201, @gaocegege)
chore: Refactor cmd (#1199, @gaocegege)
bugfix for multi_worker_strategy-with-keras.py (#1198, @jiaqianjing)
Fix error when conditions is empty. (#1185, @Corea)
b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language (#1190, @sculd)
chore: Update OWNERS (#1177, @gaocegege)
Update developer_guide.md (#1176, @pingsutw)
Update swagger-codegen-cli URL (#1172, @jinchihe)
Use go mod (#1144, @xychu)
Make tf_operator use static compilation in container (#1160, @MrXinWang)
Update tf_job_client.py remove unused variable. (#1157, @NikeNano)
Update e2e_testing.md (#1155, @NikeNano)
Disable istio sidecar injection in simple tfjob test (#1148, @Bobgy)
OWNERS: Add ChanYiLin as approver (#1147, @ChanYiLin)
Remove unused function arg (#1145, @zhujl1991)
docs: Add roadmap (#1140, @gaocegege)
simple_tfjob_tests py3 version (#1134, @gabrielwen)
add tf-operator test in py3 (#1133, @gabrielwen)
Distroless image for TF operator (#1124, @krishnadurai)
SDK support getting the TFJob training logs (#1130, @jinchihe)
Copy third party vendor source code to Docker image (#1128, @richardsliu)
Add third party licenses (#1127, @richardsliu)
remove tfjob dashboard (#1119, @ChanYiLin)
Update checking status API name (#1117, @jinchihe)
Add more APIs for TFJob done (#1116, @jinchihe)
feat: Add adopters in README (#1092, @gaocegege)
Support for ppc64le (#1082, @zoyun)
use multi-stage build to build tf-operator image (#1072, @hmtai)
add ppc64le support for the example dist-mnist (#1084, @alongzhi)
add the dockerfile for ppc64le (#1083, @alongzhi)
Updating issue bot configs (#1074, @rbrishabh)
Delete v1beta2 api (#1075, @johnugeorge)
add ldflag verion (#1052, @yeya24)
Add verify-codegen in travis CI (#1070, @ohmystack)
Set tfjob defaults in test utils (#1071, @ohmystack)
Update codegen (#1069, @ohmystack)
rewrite dockerfile (#1062, @hmtai)
Renaming labels to common types (#1064, @johnugeorge)
add total suffix in counter metrics (#1055, @yeya24)
Update k8s libraries to 1.12.3 (#1054, @johnugeorge)
add flag kubeconfig (#1049, @yeya24)
Easily detect the GOPATH in current development environment. (#1047, @xauthulei)
Update gang scheduler name (#1028, @goodluckbot)
Set worker 0 completed if pod's phase goto succeeded (#1042, @ScorpioCPH)
Removing unnecessary Rbac authorization (#1036, @johnugeorge)
refactor: add GenPodGroupName method to extract podGroupName in diffe… (#1034, @zlcnju)
update release script (#1040, @kunmingg)
Update image base to UBI8 GA (#1023, @pdmack)