training-operator

Distributed ML Training and Fine-Tuning on Kubernetes

APACHE-2.0 License

Downloads
15.6K
Stars
1.5K
Committers
179

Bot releases are hidden (Show)

training-operator - v1.8.0-rc.0 release Latest Release

Published by johnugeorge 6 months ago

New features

Bug fixes

Misc

training-operator - v1.7.0 release

Published by johnugeorge 12 months ago

Breaking Changes

New features

Bug fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

training-operator - v1.7.0-rc.0 release

Published by johnugeorge about 1 year ago

Breaking Changes

New features

Bug fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

training-operator - v1.6.0 release

Published by johnugeorge over 1 year ago

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

New Features

Bug fixes

Misc

Closed issues:

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
  • paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
  • *job API(master) cannot compatible with old job #1725
  • Support coscheduling plugin #1722
  • Number of worker threads used by the controller can't be configured #1706
  • Conformance: Training tests #1698
  • PyTorch and MPI Operator pulls hardcoded initContainer #1696
  • PaddlePaddle Training: why can't find pods #1694
  • Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
  • [SDK] Create unify client for all Training Job types #1691
  • Support Kubernetes v1.25 #1682
  • panic happened when add podgroup watch #1679
  • OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
  • There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
  • Change Kubernetes version for test #1665
  • Support for multiplatform container imege (amd64 and arm64) #1664
  • Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
  • After setting hostNetwork to true, mpi does not work #1657
  • What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
  • When will MPIJob support v2beta1 version? #1653
  • Kubernetes HPA doesn't work with elastic PytorchJob #1645
  • training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
  • Training operator fails to create HPA for TorchElastic jobs #1626
  • Release v1.5.0 tracking #1622
  • upgrade client-go #1599
  • trainning-operator may need to monitor PodGroup #1574
  • Error: invalid memory address or nil pointer dereference #1553
  • The pytorchJob training is slow #1532
  • pytorch elastic scheduler error #1504
training-operator - v1.6.0-rc.1 release

Published by johnugeorge over 1 year ago

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

Closed issues:

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
  • paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
  • *job API(master) cannot compatible with old job #1725
  • Support coscheduling plugin #1722
  • Number of worker threads used by the controller can't be configured #1706
  • Conformance: Training tests #1698
  • PyTorch and MPI Operator pulls hardcoded initContainer #1696
  • PaddlePaddle Training: why can't find pods #1694
  • Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
  • [SDK] Create unify client for all Training Job types #1691
  • Support Kubernetes v1.25 #1682
  • panic happened when add podgroup watch #1679
  • OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
  • There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
  • Change Kubernetes version for test #1665
  • Support for multiplatform container imege (amd64 and arm64) #1664
  • Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
  • After setting hostNetwork to true, mpi does not work #1657
  • What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
  • When will MPIJob support v2beta1 version? #1653
  • Kubernetes HPA doesn't work with elastic PytorchJob #1645
  • training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
  • Training operator fails to create HPA for TorchElastic jobs #1626
  • Release v1.5.0 tracking #1622
  • upgrade client-go #1599
  • trainning-operator may need to monitor PodGroup #1574
  • Error: invalid memory address or nil pointer dereference #1553
  • The pytorchJob training is slow #1532
  • pytorch elastic scheduler error #1504
training-operator - v1.6.0-rc.0 release

Published by johnugeorge over 1 year ago

v1.6.0-rc.0 release

training-operator - v1.5.0 release

Published by johnugeorge about 2 years ago

Full Changelog

New Features

Bug Fixes

Misc

training-operator - v1.5.0-rc.0 release

Published by johnugeorge over 2 years ago

Full Changelog

Closed issues:

  • MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
  • unable to fetch TFJob when I use client.go run tfjob #1612
  • Pytorchjob dist-mnist no training logs #1601
  • kubectl get tfjob -o yaml, but not status output #1598
  • missing image in tf_job_design_doc.md #1590
  • Labels in Python client are out of date #1587
  • PyTorchJob Pods "Not Ready" After Completing Training #1577
  • cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value #1576
  • PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570
  • pytorchjob doesn't have status.startTIme. #1566
  • Optional-test-infra Deprecation Notice - Training #1561
  • Should we update MPIJob unit test CleanPodPolicy field? #1555
  • --enable-gang-scheduling=true doesn't work for MPIJob #1548
  • PyTorchJob fails when creating a task with a different namespace but the same name #1543
  • Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: "null" after enable-gang-scheduling #1538
  • Job TTLs not working #1533
  • Support PodGroup in scheduler-plugins/coscheduling #1518
  • support elastic training #1515
  • Modified the configuration of RootLogger #1514
  • Add checking import order in CI #1510
  • Scale down of pytorchJob cause workers pod to restart #1509
  • Support label selector based success/failure conditions #1507
  • [feat] Support SuccessPolicy in PyTorchJob #1505
  • pytorch elastic scheduler error #1504
  • Could you add the example of MPIJob in this repository #1502
  • [Feature] Create a Informer/ClientSet for PyTorch Jobs #1499
  • [feature] Make init container injection logic availabel to all jobs #1498
  • Roadmaps for 1.4 release #1496
  • [bug] (MpiJob)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. #1494
  • Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
  • Python PytorchJob: no attribute openapi_types for example code #1481
  • PyTorch DistributedDataParallel training with multi nodes #1475
  • Installing kubeflow-training breaks import for other kubeflow packages (katib, fairing, etc.) #1471
  • Deprecate ksonnet and use python/golang to submit jobs #1468
  • Help Wanted in ParameterServerStrategy Example. #1459
  • Bug: SomeTimes Coredumped using tfjob #1456
  • [question] PyTorchJob MNIST example training speed #1454
  • tfjob status not match when EnableDynamicWorker set true #1452
  • training-operator set scheduler error #1447
  • [sdk]: Replace TableLogger component in the SDK for better support with ipykernel>=6.x #1446
  • SDK: wait_for_job reports typeError #1445
  • Update prometheus monitoring doc #1443
  • Master branch should provide a nightly image #1433
  • Clean up test folder before testing #1429
  • Clean up TF specific docs #1424
  • [feature] Support SchedulingPolicy in PyTorchJob #1414
  • Hyperlinks in the "Overview" section is incorrect/not found #1411
  • add workqueue metric #1407
  • Validation fails for MXJob Tune example #1402
  • Rate exceeded for aws ecr image #1400
  • change layout to follow the standard of kubebuilder? #1397
  • [example] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist #1393
  • Update kubeflow/website for 1.4 release #1392
  • Cut beta release of tf-operator for 1.4 release #1385
  • "invalid memory address or nil pointer dereference" #1382
  • some questions about job sync #1379
  • Provides a default Grafana dashboard #1376
  • [feature] Support different PS/worker types #1369
  • Need to copy all (mainly pytorch) framework's example dir to tf-operator/examples #1366
  • Add more CRD validations markers to block invalid job on client apply #1363
  • Update presubmit and post submit job triggers #1354
  • Optimize post submit jobs flow #1353
  • Enable leader election in controller manager using controllermanagerconfig #1350
  • Support mpi jobs in universal operator #1345
  • post-submit job failure in master branch #1343
  • Improve observability of universal operator #1340
  • Best practice to organize main.go and Dockerfile? #1333
  • Should training operator keep clientset in the same repository? #1332
  • Test image has incorrect tag? #1329
  • Prepare e2e tests for all frameworks #1323
  • Reduce e2e replica-restart-policy-tests running time #1319
  • Improve logs structure by consolidating libs from controller runtime and controllers #1313
  • Enable tests for all frameworks #1311
  • [bug] The pod wil be recreated until the expectation expires #1306
  • Upgrade CRDs to apiextensions.k8s.io/v1 #1304
  • Add role details as new columns to kubectl get jobs output for CRD. #1301
  • How to handle long pending pods in a TF-job? #1282
  • Could you release a new version of Python SDK #1279
  • Update swagger.json schema for TFJobSpec to include RunPolicy #1278
  • Not able to pass environment variable from tfjob to pod #1273
  • v1_time.py is not generated by hack/python-sdk/gen-sdk.sh #1271
  • Add a step to upload artifact #1258
  • [feature] Support multi port in TFJob #1251
  • [feat] Add scale subresource #1220
  • Pod get re-created after it exited and get garbage collected #1186
  • Clean up vendor dependencies #1162

Merged pull requests:

training-operator - v1.4.0

Published by johnugeorge over 2 years ago

Full Changelog

Merged pull requests:

Closed issues:

  • Question: What is the recommended way for Data Scientists to run a distributed training job #1535
  • Restore KUBEFLOW_NAMESPACE options #1522
  • Improve test coverage #1497
  • swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
  • [bug] Missing init container in PyTorchJob #1482
  • PytorchJob DDP training will stop if I delete a worker pod #1478
  • Write down e2e failure debug process #1467
  • How can i add the Priorityclass to the TFjob? #1466
  • github.com/go-logr/zapr.(*zapLogger).Error #1444
  • Display coverage % in GitHub actions list #1442
  • Add Go test to CI #1436
  • Podgroup is constantly created and deleted after tfjob is success or failure #1426
  • Cut official release of 1.3.0 #1425
  • Add "not maintained" notice to other operator repos #1423
  • Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
  • Python SDK for Kubeflow Training Operator #1380
  • Rename this repo #1348
  • Universal Operator Phase III: Graduate operator to production grade #1318
training-operator - v1.4.0-rc.0 release

Published by johnugeorge over 2 years ago

Full Changelog

Features and improvements:

  • Display coverage % in GitHub actions list #1442
  • Add Go test to CI #1436

Fixed bugs:

  • [bug] Missing init container in PyTorchJob #1482
  • Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381

Closed issues:

  • Restore KUBEFLOW_NAMESPACE options #1522
  • Improve test coverage #1497
  • swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
  • PytorchJob DDP training will stop if I delete a worker pod #1478
  • Write down e2e failure debug process #1467
  • How can i add the Priorityclass to the TFjob? #1466
  • github.com/go-logr/zapr.(*zapLogger).Error #1444
  • Podgroup is constantly created and deleted after tfjob is success or failure #1426
  • Cut official release of 1.3.0 #1425
  • Add "not maintained" notice to other operator repos #1423
  • Python SDK for Kubeflow Training Operator #1380

Merged pull requests:

training-operator - v1.3.0

Published by Jeffwan about 3 years ago

v1.3.0 (2021-10-03)

Full Changelog

Features

  • Feature/support pytorchjob set queue of volcano (#1415, @qiankunli)
  • Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta (#1409, @Jeffwan)
  • add api-doc for all frameworks (#1370, @DeliangFan)
  • Update training operator release process (#1347, @Jeffwan)
  • add option to enable schemes (#1342, @zw0610)
  • Don't use inline for runPolicy and register defaulter function (#1330, @Jeffwan)
  • Use low-level controller and handlers in SetupWithManager (#1315, @Jeffwan)
  • Reorganize controllers code structure (#1302, @Jeffwan)
  • Unify code structure of training job api (#1300, @Jeffwan)
  • Extract reusable codes to common utility (#1297, @Jeffwan)
  • Add MXJob v1 api and controller (#1296, @Jeffwan)
  • add tf reconciler to the unified operator (#1295, @zw0610)
  • Add XGBoost controller (#1293, @Jeffwan)
  • add pytorch API and controller (#1294, @zw0610)
  • Generate tfjob 1.19.x compatible clientset (#1290, @Jeffwan)
  • add XGBoostJob api (#1286, @zw0610)

Bug fixes

  • fix hyperlinks in the 'overview' section (#1418, @pramodrj07)
  • 2010: fix to expose correct monitoring port #1405 (deepak-muley)
  • Fix 1399: added pod matching label in service selector #1404 (deepak-muley)
  • fix: runPolicy validation error in the examples #1401 (Jeffwan)
  • fix: volcano pod group creation issue (#1390, @Jeffwan)
  • Fix 1340 prometheus counters (#1375, @deepak-muley)
  • fix makefile to store crds in a separate folder (#1368, @deepak-muley)
  • Fix copyright header for some files (#1371, @DeliangFan)
  • fix incorrect torch env population (#1361, @Jeffwan)
  • fix: Resolve scheme registration issue for defaulters (#1360, @Jeffwan)
  • Fix XGBoost container name in log message (#1362, @andreyvelich)
  • Fix postsubmit job using PULL_BASE_SHA (#1344, @Jeffwan)
  • Fix all client request that needs contexts (#1292, @Jeffwan)

Misc

  • Update manifests with latest image tag #1406 (johnugeorge)
  • Add simple verification jobs (#1391, @Jeffwan)
  • chore: Bump kubeflow/common version to 0.3.7 (#1388, @Jeffwan)
  • chore(doc): Update README.md (#1387, @Jeffwan)
  • Remove lagacy tf-operator from the codebase (#1378, @thunderboltsid)
  • chore: Update manifest image tag (#1364, @Jeffwan)
  • add example for mxnet and pytorch (#1373, @DeliangFan)
  • docs: Update to use kubectl kustomize in instructions (#1356, @terrytangyuan)
  • Clean up manifests and remove unused files (#1349, @Jeffwan)
  • 1322: Modified manifests to use all-in-one training-operator (#1346, @deepak-muley)
  • chore: Update changelog.md (#1339, @Jeffwan)
  • Move root level docs to docs folder (#1338, @Jeffwan)
  • Temporarily add Jeffwan@ to OWNERS (#1287, @Jeffwan)

Testing

  • Update include dirs in prow config (#1374, @andreyvelich)
  • Enable e2e test against universal operator (#1336, @Jeffwan)
  • Use PULL_PULL_SHA for image (#1334, @PatrickXYS)
training-operator - v1.3.0-rc.2

Published by Jeffwan about 3 years ago

v1.3.0-rc.2 (2021-09-20)

Full Changelog

Closed issues:

  • [bug] Unable to specify pod template metadata for TFJob #1403

Merged pull requests:

  • Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1410 (Jeffwan)
  • chore: Update image tags in manifests #1412 (Jeffwan)
training-operator - v1.3.0-rc.1

Published by johnugeorge about 3 years ago

v1.3.0-rc.1 (2021-09-15)

Full Changelog

Closed issues:

  • [bug] Reconcilation fails when upgrading common to 0.3.6 #1394

Merged pull requests:

training-operator - v1.3.0-rc.0

Published by Jeffwan about 3 years ago

v1.3.0-rc.0 (2021-08-31)

Full Changelog

Features

  • add api-doc for all frameworks (#1370, @DeliangFan)
  • Update training operator release process (#1347, @Jeffwan)
  • add option to enable schemes (#1342, @zw0610)
  • Don't use inline for runPolicy and register defaulter function (#1330, @Jeffwan)
  • Use low-level controller and handlers in SetupWithManager (#1315, @Jeffwan)
  • Reorganize controllers code structure (#1302, @Jeffwan)
  • Unify code structure of training job api (#1300, @Jeffwan)
  • Extract reusable codes to common utility (#1297, @Jeffwan)
  • Add MXJob v1 api and controller (#1296, @Jeffwan)
  • add tf reconciler to the unified operator (#1295, @zw0610)
  • Add XGBoost controller (#1293, @Jeffwan)
  • add pytorch API and controller (#1294, @zw0610)
  • Generate tfjob 1.19.x compatible clientset (#1290, @Jeffwan)
  • add XGBoostJob api (#1286, @zw0610)

Bug fixes

  • fix: volcano pod group creation issue (#1390, @Jeffwan)
  • Fix 1340 prometheus counters (#1375, @deepak-muley)
  • fix makefile to store crds in a separate folder (#1368, @deepak-muley)
  • Fix copyright header for some files (#1371, @DeliangFan)
  • fix incorrect torch env population (#1361, @Jeffwan)
  • fix: Resolve scheme registration issue for defaulters (#1360, @Jeffwan)
  • Fix XGBoost container name in log message (#1362, @andreyvelich)
  • Fix postsubmit job using PULL_BASE_SHA (#1344, @Jeffwan)
  • Fix all client request that needs contexts (#1292, @Jeffwan)

Misc

  • Add simple verification jobs (#1391, @Jeffwan)
  • chore: Bump kubeflow/common version to 0.3.7 (#1388, @Jeffwan)
  • chore(doc): Update README.md (#1387, @Jeffwan)
  • Remove lagacy tf-operator from the codebase (#1378, @thunderboltsid)
  • chore: Update manifest image tag (#1364, @Jeffwan)
  • add example for mxnet and pytorch (#1373, @DeliangFan)
  • docs: Update to use kubectl kustomize in instructions (#1356, @terrytangyuan)
  • Clean up manifests and remove unused files (#1349, @Jeffwan)
  • 1322: Modified manifests to use all-in-one training-operator (#1346, @deepak-muley)
  • chore: Update changelog.md (#1339, @Jeffwan)
  • Move root level docs to docs folder (#1338, @Jeffwan)
  • Temporarily add Jeffwan@ to OWNERS (#1287, @Jeffwan)

Testing

  • Update include dirs in prow config (#1374, @andreyvelich)
  • Enable e2e test against universal operator (#1336, @Jeffwan)
  • Use PULL_PULL_SHA for image (#1334, @PatrickXYS)
training-operator - v1.2.1

Published by gaocegege about 3 years ago

v1.2.1 (2021-08-27)

Full Changelog

Closed issues:

  • volcano scheduler with customized queue #1377
  • [chore] Delete the all-in-one branch #1372
  • Fix scheme registration issue #1359
  • Distributive Gloo PyTorchJob example doesn't work #1358
  • Enable CI pipeline against v1.2-branch #1355
  • Generate API Documentation for all frameworks #1341
  • make test failed due to invalid crd schema #1324
  • Cut 1.2.0 tag and release a stable version of master #1321
  • Copyright header is not correctly generated #1309

Merged pull requests:

  • fix(init): Fix crash problem when enabling gang scheduling #1384 (gaocegege)
training-operator - v1.3.0-alpha.3

Published by Jeffwan about 3 years ago

v1.3.0 will be the first release version to support tensorflow, pytorch, mxnet and xgboost distributed training jobs.
More background can be found in design doc All-in-one Kubeflow Training Operator

Install Kubeflow training operator by running:

 kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.3.0-alpha.3"

require kubectl >= 1.21.x

training-operator - v1.3.0-alpha.2

Published by Jeffwan about 3 years ago

v1.3.0 will be the first release version to support tensorflow, pytorch, mxnet and xgboost distributed training jobs.
More background can be found in design doc All-in-one Kubeflow Training Operator

Install Kubeflow training operator by running:

 kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.3.0-alpha.2"

require kubectl >= 1.21.x

training-operator - v1.3.0-alpha.1

Published by Jeffwan about 3 years ago

v1.3.0 will be the first release version to support tensorflow, pytorch, mxnet and xgboost distributed training jobs.
More background can be found in design doc All-in-one Kubeflow Training Operator

Install Kubeflow training operator by running:

 kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.3.0-alpha.2"

require kubectl >= 1.21.x

training-operator - v1.2.0 release

Published by Jeffwan about 3 years ago

v1.2.0 (2021-08-03)

Full Changelog

Features

  • Add job namespace to tf_operator_jobs_* counters (#1283, @alembiewski)
  • feat: upgrade kubeflow common and volcano version (#1276, @shinytang6)
  • Add task type annotation for pods when EnableGangScheduling is true. (#1268, @jiangkaihua)

Bug fixes

  • Fix invalid pointer when tfjob is deleted (#1285, @johnugeorge)
  • fix get_logs pod_names type and iteration blocking (#1280, @Windfarer)
  • fix calling custom_api.delete_namespaced_custom_object args error (#1281, @Windfarer)
  • fix: Remove the dup comment tag (#1274, @gaocegege)
  • Fix: Remove Github CD workflow (#1263, @PatrickXYS)
  • Fix: the "follow" of TFJobClient.get_logs (#1254, @Windfarer)

Misc

  • Update container image for v1.1.1 (#1328, @Jeffwan)
  • add a specific version of tensorflow_datasets (#1305, @jazzsir)
  • Remove vendor folder (#1288, @Jeffwan)
  • add podgroups rule in cluster-role.yaml (#1272, @huone1)
  • Use remote Kustomize build option in standalone installation instructions (#1266, @verult)
training-operator - v1.1.0 release

Published by Jeffwan over 3 years ago

This is a large official release since v0.5.3. Please give more feedbacks. Thanks for all contributors.

Features

  • feat: Remove k8s.io/kubernetes (#1235, @gaocegege)
  • Migrate to public ECR (#1256, @PatrickXYS)
  • feat: Add API Documentation WIP (#1249, @gaocegege)
  • feat: Update developers guide and readme (#1244, @gaocegege)
  • Move TF Operator e2e tests to AWS Prow (#1204, @ChanYiLin)
  • crd definition support multiple evaluator (#1240, @oikomi)
  • support multiple evaluators (#1239, @oikomi)
  • feat: Change the message for running condition (#1230, @gaocegege)
  • feat(server): Use apiextension client to check if crd exists (#1228, @gaocegege)
  • checkCRDExists func return true when k8s cluster is not connected (#1207, @oikomi)
  • feat: Add CD using GitHub Actions (#1196, @gaocegege)
  • Migrate controller implementation to kubeflow/common fashion (#1171, @ChanYiLin)
  • Support success policy for TFJob (#1165, @terrytangyuan)
  • add distributed training example of using TF 2.1 Strategy API (#1164, @jazzsir)
  • Set completion time when job exceed specified deadline. (#1150, @SimonCqk)
  • Support ClusterSpec Propagation Feature in TF 1.14 (#1149, @zhujl1991)
  • Add watch function for TFJob python Client API (#1122, @jinchihe)
  • Enhance tfjobs sdk docs (#1114, @jinchihe)
  • Generate TFJob Python SDK (#1103, @jinchihe)
  • feat: Support pprof when monitoring is specified (#1102, @gaocegege)
  • feat: Use kubeflow/common (#1088, @gaocegege)
  • Add support for aarch64 (#1098, @MrXinWang)
  • feat: Do not set TF_CONFIG for local training (#1080, @gaocegege)
  • feat: Replace gometalinter with golangci-lint (#1081, @gaocegege)
  • Add controller-name label for Pod and service (#1067, @hougangliu)
  • Add qps and burst options (#1063, @ScorpioCPH)
  • Avoid unnecessary update when tfjob is complete (#1051, @cheyang)
  • set annotation automatically when EnableGangScheduling is set to true (#1032, @ChanYiLin)
  • feat(pod): Support custom gang scheduler via CLI argument (#1050, @gaocegege)

Bug fixes

  • Fix kubeflow overlay (#1260, @PatrickXYS)
  • fix: Do not validate evaluator (#1238, @gaocegege)
  • fix: Remove default resync period (#1237, @gaocegege)
  • fix: Observe the creation when failed to create the pod (#1236, @gaocegege)
  • fix: Remove vendor cp command (#1232, @gaocegege)
  • Fix completion time setting bug (#1226, @shaowei-su)
  • feat(deploy): Add standalone deployment yaml (#1218, @gaocegege)
  • Fix updateStatus no worker Crashoff (#1215, @kuikuikuizzZ)
  • fix: Fix the log message (#1203, @gaocegege)
  • Fix the typo (#1178, @pingsutw)
  • Fix setup cluster issue and Pylint issue in CI tests (#1179, @jinchihe)
  • Fix the link to run_e2e_workflow.py script (#1154, @terrytangyuan)
  • Fix evaluator runconfig (#1146, @richardsliu)
  • Fix sdk test issue that's caused by kubenertes Client bug. (#1143, @jinchihe)
  • fix(controller): calculate satisfied with && instead of || (#1120, @GuoHaiqing)
  • fix comment, add +optional flag to comment. (#1137, @EDGsheryl)
  • fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured (#1118, @leileiwan)
  • fix the reconcile flow (#1111, @ChanYiLin)
  • Fix example Mnist With Summaries (#1073, @andreyvelich)
  • fix bug: When executing tf-operator.v1 -version, GitSHA is always 'not provided' (#1046, @asdfsx)
  • fix(UI): show correct namespace and name when deleting job through dashboard (#1044, @gbin10533)
  • Minor fix to add CoreV1 to scheme (#1037, @johnugeorge)
  • fix(docs): Fix link for simple_TFJob_test (#1038, @gaocegege)
  • fix: Remove dup code (#1022, @gaocegege)

Chores

  • tf-operator: Consolidate manifests (#1255, @yanniszark)
  • TFJob Operator: Move manifests development upstream (#1247, @yanniszark)
  • Update vendor as kubeflow/common is updated. (#1252, @jiangkaihua)
  • docs: Add Ant Group to ADOPTERS.md (#1243, @terrytangyuan)
  • chore: Add tencent cloud (#1234, @gaocegege)
  • add vip (#1233, @oikomi)
  • chore: Update changelog (#1227, @gaocegege)
  • Update kubeflow common to 0.3.2 (#1225, @shaowei-su)
  • chore: Remove useless expectation (#1217, @gaocegege)
  • chore: Update codegen (#1211, @gaocegege)
  • add Evaluator type for CRD example (#1209, @oikomi)
  • add err log for create client set failed and code minor optimization (#1210, @oikomi)
  • chore: Remove the kanban update workflow (#1201, @gaocegege)
  • chore: Refactor cmd (#1199, @gaocegege)
  • bugfix for multi_worker_strategy-with-keras.py (#1198, @jiaqianjing)
  • Fix error when conditions is empty. (#1185, @Corea)
  • b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language (#1190, @sculd)
  • chore: Update OWNERS (#1177, @gaocegege)
  • Update developer_guide.md (#1176, @pingsutw)
  • Update swagger-codegen-cli URL (#1172, @jinchihe)
  • Use go mod (#1144, @xychu)
  • Make tf_operator use static compilation in container (#1160, @MrXinWang)
  • Update tf_job_client.py remove unused variable. (#1157, @NikeNano)
  • Update e2e_testing.md (#1155, @NikeNano)
  • Disable istio sidecar injection in simple tfjob test (#1148, @Bobgy)
  • OWNERS: Add ChanYiLin as approver (#1147, @ChanYiLin)
  • Remove unused function arg (#1145, @zhujl1991)
  • docs: Add roadmap (#1140, @gaocegege)
  • simple_tfjob_tests py3 version (#1134, @gabrielwen)
  • add tf-operator test in py3 (#1133, @gabrielwen)
  • Distroless image for TF operator (#1124, @krishnadurai)
  • SDK support getting the TFJob training logs (#1130, @jinchihe)
  • Copy third party vendor source code to Docker image (#1128, @richardsliu)
  • Add third party licenses (#1127, @richardsliu)
  • remove tfjob dashboard (#1119, @ChanYiLin)
  • Update checking status API name (#1117, @jinchihe)
  • Add more APIs for TFJob done (#1116, @jinchihe)
  • feat: Add adopters in README (#1092, @gaocegege)
  • Support for ppc64le (#1082, @zoyun)
  • use multi-stage build to build tf-operator image (#1072, @hmtai)
  • add ppc64le support for the example dist-mnist (#1084, @alongzhi)
  • add the dockerfile for ppc64le (#1083, @alongzhi)
  • Updating issue bot configs (#1074, @rbrishabh)
  • Delete v1beta2 api (#1075, @johnugeorge)
  • add ldflag verion (#1052, @yeya24)
  • Add verify-codegen in travis CI (#1070, @ohmystack)
  • Set tfjob defaults in test utils (#1071, @ohmystack)
  • Update codegen (#1069, @ohmystack)
  • rewrite dockerfile (#1062, @hmtai)
  • Renaming labels to common types (#1064, @johnugeorge)
  • add total suffix in counter metrics (#1055, @yeya24)
  • Update k8s libraries to 1.12.3 (#1054, @johnugeorge)
  • add flag kubeconfig (#1049, @yeya24)
  • Easily detect the GOPATH in current development environment. (#1047, @xauthulei)
  • Update gang scheduler name (#1028, @goodluckbot)
  • Set worker 0 completed if pod's phase goto succeeded (#1042, @ScorpioCPH)
  • Removing unnecessary Rbac authorization (#1036, @johnugeorge)
  • refactor: add GenPodGroupName method to extract podGroupName in diffe… (#1034, @zlcnju)
  • update release script (#1040, @kunmingg)
  • Update image base to UBI8 GA (#1023, @pdmack)