training-operator | Python Ecosystem Directory

Bot releases are hidden (Show)

training-operator - v1.0.1-rc.5

Published by gaocegege over 3 years ago

training-operator - v1.0.1-rc.4

Published by gaocegege over 3 years ago

v1.0.1-rc.4 (2021-02-04)

Full Changelog

Closed issues:

I have some questions about the function createNewPod in pkg/controller.v1/tensorflow/pod.go #1221

Merged pull requests:

fix: Remove default resync period #1237 (gaocegege)
fix: Observe the creation when failed to create the pod #1236 (gaocegege)
feat: Remove k8s.io/kubernetes #1235 (gaocegege)
chore: Add tencent cloud #1234 (gaocegege)
add vip #1233 (oikomi)
fix: Remove vendor cp command #1232 (gaocegege)
feat: Change the message for running condition #1230 (gaocegege)
chore: Update changelog #1227 (gaocegege)

training-operator - v1.0.1-rc.3

Published by gaocegege over 3 years ago

v1.0.1-rc.3 (2021-01-27)

Full Changelog

Closed issues:

Error with release tag v1.0.1 "invalid memory address or nil pointer dereference" #1223

Merged pull requests:

feat(server): Use apiextension client to check if crd exists #1228 (gaocegege)

training-operator - v1.0.1-rc.2

Published by gaocegege over 3 years ago

v1.0.1-rc.2 (2021-01-27)

Full Changelog

Merged pull requests:

Fix completion time setting bug #1226 (shaowei-su)
Update kubeflow common to 0.3.2 #1225 (shaowei-su)

training-operator - v1.0.1-rc.1

Published by gaocegege over 3 years ago

v1.0.1-rc.1 (2021-01-18)

Full Changelog

Closed issues:

checkCRDExists func return true when k8s cluster is not connected #1206
How to install it without kubeflow #1195
Pod get re-created after it exited and get garbage collected #1186
Surface Pod and other Errors that Prevent TFJob from starting #1131
Jobs failing when a node is preempted #999

Merged pull requests:

feat(deploy): Add standalone deployment yaml #1218 (gaocegege)
chore: Remove useless expectation #1217 (gaocegege)
Fix updateStatus no worker Crashoff #1215 (kuikuikuizzZ)
chore: Update codegen #1211 (gaocegege)
add err log for create client set failed and code minor optimization #1210 (oikomi)
add Evaluator type for CRD example #1209 (oikomi)
checkCRDExists func return true when k8s cluster is not connected #1207 (oikomi)
fix: Fix the log message #1203 (gaocegege)

training-operator - v1.0.1-rc.0

Published by gaocegege almost 4 years ago

v1.0.1-rc.0 (2020-12-22)

Full Changelog

Closed issues:

tf-operator panic without worker role #1192
TFJob completion with active services/endpoints resources #1191
Having trouble viewing logs using Kubernetes dashboard #1189
[feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
TFJob cannot utilize GPUs in the node. #1184
[bug] With Python SDK, TFJob won't stop running #1183
[bug] [Python SDK] tfjob_client.get_logs broken #1182
How to create a python sdk for mxnet-operator #1181
[feature] python sdk should report errors in created TFJobs #1180
Could not introduce k8s.io/kube-openapi@master #1174
can tf-operator used in distribute scene, such as Multi-node #1173
Multi-worker training with Keras only use one GPU #1169
NCCL WARN Failed to open libibverbs.so[.1] #1168
tf-job-operator pod restarts #1167
swagger-codegen-cli-2.4.6.jar not found #1166
Cut release for tf-operator project #1163
Replace reconciler implementation with kubeflow/common JobController #1161
Error while replicating mnist_with_summaries #1159
Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
TFjob pods hang without explanation #1156
[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
evaluator should be set in TF_CONFIG when using Estimator distribute strategy #1139
Is there any case to run the different command in tfReplicaSpecs? #1138
should gpu resource be released when tfjob failed because of image pull problem? #1136
tf-job-operator CrashLoopBackOff #1135
How to change the log level of tf-job-operator #1132
Support getting the training process via Python SDK #1129
Popgroup is not created automatically. #1121
TFConfig should be demonstrated more specifically. #1115
[chore] Remove tfjob dashboard #1113
read TF_CONFIG env from configMap #1112
Long job names result in jobs stuck forever #1101
[Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
can i install tf-operator alone without kubeflow? #1096
c #1095
TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
TFJob tests should use pytest #1093
Multiple Evaluator replicas gives InvalidTFJobSpec #1091
Java client for current version of TFjob #1090
[enhancement] Replace common with kubeflow/common #1087
Lack of documents for deployment #1086
Performance problem about pod informer #1079
[bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
Separate cluster scoped and namespace scoped resources #1077
TFJob 1.0 #1076
[bug] Keep tf-job-role as deprecated label in this version #1068
GenLabels may select wrong Pods #1066
Can I create a tf-operator pod without using GO? #1065
tf-job-dashboard cannot work #1060
[discussion] Should We Add CleanPodPolicy PS? #1059
Refactor dockerfile #1058
remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
Invalid value: "v1beta1": must appear in spec.versions #1056
Example on EKS: Device or resource busy #1053
can we add PriorityClassName when we create TF-job Podgroup? #1048
TFjob still running while chief pod is completed #1045
Is there any document for how to run TFJob in AllReduce Strategy #1039
tf-operator version conficts #1035
Add E2E test for gang-scheduling #1033
gang schedule annotation #1031
[feature] Can we use one headless service for one job? #1030
Will tf-operator upgrading k8s to 1.13? #1029
no error log for create tfjob fail #1026
Creating tfjob in dashboard usability issues #1024
Deleting tf-job through the dashboard is not working #1019
Create common CRD validate and mutating webhook for all operator #1016
error with kubeflow instalation #996
Shall we consider upgrading k8s to 1.11.3 #985
TFJob Dashboard is not support pvc #980
ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
Create CRD conversion webhook #967
Performance issue when there is a lot of completed jobs #965
Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
Proposal for a Common Operator #960
Delete pod with unknown status in reconcilePods #956
Create distributed training example for TF 2.0 #953
Consider using KubeBuilder to reduce boilerplate code #925
e2e test for dashboard/backend/handler/api_handler.go #921
Use pod group instead of PDB for gang scheduling #916
shareProcessNamespace not working with TFJob #902
[feasibility-research] Handle machine failure #900
Should limit the size of logs of tf_operator container #888
Log message severity isn't properly reported in stackdriver #864
E2E test for invalid spec errors #810
[v1alpha2] Delete resources according to cleanuppolicy exactly once #804
refactor the code of TFJobController for unittest #757
e2e test for cleanupTFJob #756
[build] Replace Python with Make or Bazel #739
Export TF/Tensorboard/TF Summaries to prometheus #722
[discussion] Maintain Helm Chart #716
[discussion] Capacity planning #708
[v1alpha2] Generate CRD validation in Kubernetes 1.11 #622
Set labels and annotations for svc created by tf_operator #609
mnist test isn't part of CI #597
[v1alpha2] Push the example docker image to google or dockerhub registry #590
feat: use fake client-set and informer add controller unittest. #540
Run submit_release_job.sh in CI #519
Add environment name in ControllerConfig #450
[dashboard] How to handle storage? #449
[dashboard] GPU limits are not taken into account #448
[dashboard] Ability to create a TensorBoard instance #447
[examples] Add termination policy in examples/tf_job.yaml #438
add boilerplate header #430
[logging] Extra flag problem #427
[CI] Add hack/verify-codegen.sh in Travis CI #426
E2E workflows should ignore failures #423
[enhancement] Add OWNERS in subdirectories #415
[enhancement] Fix the warnings reported by goreportcard.com #394
[discussion] Separate the operator and UI dashboard #389
[enhancemnet] Separate release image and test image #385
[enhancement][CI] Replace Travis CI with Prow #382
use Python3 for all python code? #377
What to do about example TFJob YAML specs? #375
E2E test for non-default namespace #170
OpenAPI Client Generation for Java, Python #167
Prevent scheduling deadlocks #165
TfDebugger support #132
Refactor code in py into a proper python package #114
Update instructions and code to work with Kubernetes 1.8 #108
Build sample container as part of release process #81
Run lint (Python, Go) as a presubmit test #53
Optimize scheduling of TF Processes #35
E2E test that verifies invalid jobs are failed #30
E2E test(s) to verify that permanent and retryable errors are handled correctly. #29

Merged pull requests:

chore: Remove the kanban update workflow #1201 (gaocegege)
chore: Refactor cmd #1199 (gaocegege)
bugfix for multi_worker_strategy-with-keras.py #1198 (jiaqianjing)
feat: Add CD using GitHub Actions #1196 (gaocegege)
b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language #1190 (sculd)
Fix error when conditions is empty. #1185 (Corea)
Fix setup cluster issue and Pylint issue in CI tests #1179 (jinchihe)
Fix the typo #1178 (pingsutw)
chore: Update OWNERS #1177 (gaocegege)
Update developer_guide.md #1176 (pingsutw)
Update swagger-codegen-cli URL #1172 (jinchihe)
Migrate controller implementation to kubeflow/common fashion #1171 (ChanYiLin)
Support success policy for TFJob #1165 (terrytangyuan)
add distributed training example of using TF 2.1 Strategy API #1164 (jazzsir)
Make tf_operator use static compilation in container #1160 (MrXinWang)
Update tf_job_client.py remove unused variable. #1157 (NikeNano)
Update e2e_testing.md #1155 (NikeNano)
Fix the link to run_e2e_workflow.py script #1154 (terrytangyuan)
Set completion time when job exceed specified deadline. #1150 (SimonCqk)
Support ClusterSpec Propagation Feature in TF 1.14 #1149 (zhujl1991)
Disable istio sidecar injection in simple tfjob test #1148 (Bobgy)
OWNERS: Add ChanYiLin as approver #1147 (ChanYiLin)
Fix evaluator runconfig #1146 (richardsliu)
Remove unused function arg #1145 (zhujl1991)
Use go mod #1144 (xychu)
Fix sdk test issue that's caused by kubenertes Client bug. #1143 (jinchihe)
docs: Add roadmap #1140 (gaocegege)
fix comment, add +optional flag to comment. #1137 (EDGsheryl)
simple_tfjob_tests py3 version #1134 (gabrielwen)
add tf-operator test in py3 #1133 (gabrielwen)
SDK support getting the TFJob training logs #1130 (jinchihe)
Copy third party vendor source code to Docker image #1128 (richardsliu)
Add third party licenses #1127 (richardsliu)
Distroless image for TF operator #1124 (krishnadurai)
Add watch function for TFJob python Client API #1122 (jinchihe)
fix(controller): calculate satisfied with && instead of || #1120 (GuoHaiqing)
remove tfjob dashboard #1119 (ChanYiLin)
fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured #1118 (leileiwan)
Update checking status API name #1117 (jinchihe)
Add more APIs for TFJob done #1116 (jinchihe)
Enhance tfjobs sdk docs #1114 (jinchihe)
fix the reconcile flow #1111 (ChanYiLin)
Generate TFJob Python SDK #1103 (jinchihe)
feat: Support pprof when monitoring is specified #1102 (gaocegege)
Add support for aarch64 #1098 (MrXinWang)
feat: Add adopters in README #1092 (gaocegege)
feat: Use kubeflow/common #1088 (gaocegege)
add ppc64le support for the example dist-mnist #1084 (alongzhi)
add the dockerfile for ppc64le #1083 (alongzhi)
Support for ppc64le #1082 (zoyun)
feat: Replace gometalinter with golangci-lint #1081 (gaocegege)
feat: Do not set TF_CONFIG for local training #1080 (gaocegege)
Delete v1beta2 api #1075 (johnugeorge)
Updating issue bot configs #1074 (rbrishabh)
Fix example Mnist With Summaries #1073 (andreyvelich)
use multi-stage build to build tf-operator image #1072 (hmtai)
Set tfjob defaults in test utils #1071 (ohmystack)
Add verify-codegen in travis CI #1070 (ohmystack)
Update codegen #1069 (ohmystack)
Add controller-name label for Pod and service #1067 (hougangliu)
Renaming labels to common types #1064 (johnugeorge)
Add qps and burst options #1063 (ScorpioCPH)
rewrite dockerfile #1062 (hmtai)
add total suffix in counter metrics #1055 (yeya24)
Update k8s libraries to 1.12.3 #1054 (johnugeorge)
add ldflag verion #1052 (yeya24)
Avoid unnecessary update when tfjob is complete #1051 (cheyang)
feat(pod): Support custom gang scheduler via CLI argument #1050 (gaocegege)
add flag kubeconfig #1049 (yeya24)
Easily detect the GOPATH in current development environment. #1047 (xauthulei)
fix bug: When executing tf-operator.v1 -version, GitSHA is always 'not provided' #1046 (asdfsx)
fix(UI): show correct namespace and name when deleting job through dashboard #1044 (gbin10533)
Set worker 0 completed if pod's phase goto succeeded #1042 (ScorpioCPH)
update release script #1040 (kunmingg)
fix(docs): Fix link for simple_TFJob_test #1038 (gaocegege)
Minor fix to add CoreV1 to scheme #1037 (johnugeorge)
Removing unnecessary Rbac authorization #1036 (johnugeorge)
refactor: add GenPodGroupName method to extract podGroupName in diffe… #1034 (zlcnju)
Update gang scheduler name #1028 (goodluckbot)

training-operator - v1.0.0-rc.0

Published by kunmingg over 5 years ago

tf-operator pre-graduation

training-operator - v0.5.3

Published by richardsliu over 5 years ago

training-operator - v0.5.2

Published by richardsliu over 5 years ago

training-operator - v0.5.1

Published by richardsliu over 5 years ago

training-operator - v0.5.0

Published by richardsliu over 5 years ago

training-operator -

Published by richardsliu over 5 years ago

training-operator - v0.4.0-rc.1

Published by richardsliu almost 6 years ago

training-operator - v0.4.0-rc.0

Published by richardsliu almost 6 years ago

Initial version of 0.4.0

TFJob v1beta1 API

training-operator - v0.3.0 Release

Published by jlewi about 6 years ago

The v0.3.0 release of the TFJob operator.

training-operator - v0.2.0-rc1

Published by kunmingg over 6 years ago

tf-operator release v0.2.0, part of Kubeflow release v0.2.0.

Features and improvements:

[v1alpha2] Set event for tfjob when spec is not valid #620
[enhancement] Fix the gofmt support #586
[go] Use dep instead of glide to reduce the size of vendor #556
[v1alpha2] Enhance the logic about sync #547
[v1alpha2] Use structured log #537
[log] investigate zap #534
[v1alpha2] Try to not to always claim pods #533
[v1alpha2] Suppport customized port #532
[v1alpha2] start using kubeconfig #522
v1alpha2 integration #521
TFJob operator surface queue metrics #503
[api] Remove pending pods from active pods #484
[enhancement] Set StartTime for TFJob status #475
[Feature] Support "eval" worker in tf-operator #444
Add appropriate logging fields to the tf-operator log messages #424
[enhancement] Refactor docs #379
Deprecate TfPort and set default port for users #327
[enhancement] Add e2e test cases for recorder #317
Make the TfJob controller more event driven #314
Potential data race, maybe #302
Don't leave pods running just to get logs #128
Add hyperparameter tuning? #112
Use headless services for Training jobs #40
More validation of TfJob #25

Fixed bugs:

[v1alpha2] RealServiceControl does not set owner reference #616
TfJob operator stops working on invalid spec #561
[v1alpha2]tfjob restartPolicy for Never #555
[v1alpha2] Potential bugs when there is one worker succeeded #538
[v1alpha2][test] Avoid potential data race problem #530
Phase is wrong unexpected TfJob phase: Done #110

Closed issues:

[v1alpha2] Make restart policy a pointer #692
[v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
[v1alpha2] add pod label with job name (without namespace) #672
[v1alpha2] Pods not deleted when job finishes #671
[v1alpha2] conditions not updated #668
[v1alpha2] Move control interface to separate pakckage #665
[v1alpha2] Move test util to separate package #664
Speedup E2E test by running build and setup cluster in parallel #659
In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
[v1alpha2] service names are prefixed with namespace #654
[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
dep ensure give warning on k8s.io/apiserver #647
[v1alpha2] pod names don't include random salt #644
[v1alpha2]Unable to create pod #641
GPU tests failing; ks env doesn't exist #640
TFJob not marked as success when master exits but not workers #634
v1alpha2 - pod names don't include replica type #633
tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
tf_job_client blocks forever #606
[v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
[v1alpha2] Need ksonnet package #599
Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
[v1alpha2] Remove controller_utils.go #591
[v1alpha2] Add CI test #589
[question] dist_mnist example failed to run #588
can not set labels #580
v1alpha2 should use headless services #574
TFJob operator should pass through annotations to the pod #573
[test] Test failed because of ImagePullBackOff #567
Servable not found for request: Latest(mnist) #552
[v1alpha2] The state of distributed model training. #544
[test] copy labels and anotations to pod from tfjob #543
Unable to deploy the example TfJob in the user guide #535
[v1alpha2] Do not set default to always for restartpolicy #524
E2E test steps should exit with non zero exit code if test fails #514
[v1alpha2] Sync commits with v1alpha1 #490
Use OpenAPI validation for CRDs in k8s 1.9 #437
default install of kubeflow no longer install tf-job-dashboard #435
Use DAG functionality of Argo in our E2E tests #422
Post submits are failing with Argo #370
tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
Refactor TFJobStatus in CRD API #333
Deprecate the TfImage field #330
[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
Does TfJob controller need to do master election? #263
Setup Prow PR Dashboard #255
API: some comments about API changes from PR #215 review #249
e2e test for the case that the chief is not master #235
Use conditions instead of phase #223
Submitted tfjobs cease to start running under unknown conditions #203
Tutorials #195
Copy chart to kubernetes/charts #93
Create a web page to list releases #70
tensorflow 1.4 and estimator support #61
Set a default value for restartPolicy #55

Merged pull requests:

*: Add cleanpod policy for v1alpha2 #691 (gaocegege)
status: Fail the TFJob if PS is failed #690 (gaocegege)
Use tf_job_name not tf_job_key as the label name. #689 (jlewi)
pkg: Delete pods and services after finished #686 (gaocegege)
informer: Add comments and TODO #684 (gaocegege)
and some safety check #683 (u2takey)
Remove code that is no longer used. #681 (jlewi)
change comment with more related link #679 (u2takey)
return err if the spec area is nil after unmashal for tfjob v1alpha2 #678 (jiaxuanzhou)
fix typo #677 (u2takey)
fix restart policy with comment #676 (u2takey)
label: Remove namespace from labels #675 (gaocegege)
controller: Move control interface to control package #670 (gaocegege)
defaults: Rename the type #669 (gaocegege)
Enable the E2E tests for v1alpha2. #667 (jlewi)
*: Move test util to separate package #666 (gaocegege)
Update dep and vendor #663 (xychu)
server: Make threadiness configurable #662 (gaocegege)
dist-mnist: Move to examples #660 (gaocegege)
tfjob: Add test for copy labels and annotations #658 (gaocegege)
*: Remove namespace from service name #656 (gaocegege)
pod: Add test for exit code #652 (gaocegege)
[v1alpha2] Estimator support - Do not include evaluator in cluster spec #650 (xychu)
pods: Add cluster spec test #649 (gaocegege)
pod: Submit an event when the user specifies the restartpolicy for pod template #648 (gaocegege)
v1alpha2 E2E tests for termination policy #646 (jlewi)
status: Add test cases for failure #643 (gaocegege)
Add proper error handling for deploying the tests. #642 (jlewi)
pods: Add restart policy #638 (gaocegege)
status: Support chief #637 (gaocegege)
*: Set name for the pod tfjob.name-type-index #636 (gaocegege)
Modify presubmits to support testing with v1alpha2 #632 (jlewi)
Updates to enable e2e test for v1alpha2 #629 (ankushagarwal)
Use logging.exception to capture stack traces in logs #627 (ankushagarwal)
Pass TFJob API version instead of hardcoding it #626 (ankushagarwal)
[v1alpha2] Add distributed state management #625 (yph152)
pkg: Send events when reveive invalid spec #623 (gaocegege)
pkg: Support customized port #621 (gaocegege)
Add apiVersion parameter to simple_tfjob component #619 (ankushagarwal)
api_handler: Fix import order #618 (gaocegege)
service_control: Set owner ref for service and add test cases #617 (gaocegege)
test: Add test cases for service ref manager and control interface #615 (gaocegege)
controller: Refactor and add test cases for helper #614 (gaocegege)
[dashboard] Upgrade to v1alpha2 #613 (wbuchwalter)
controller: Improve coding styles #612 (gaocegege)
Informer: Use unstructured #610 (gaocegege)
TFJob client should not block forever trying to get the namespace object #607 (jlewi)
crd: Add validation using OpenAPI 3.0 #605 (gaocegege)
dist_mnist: Add unused_argv #604 (gaocegege)
service: Refactor to the slice structure #603 (gaocegege)
Delete the old releaser code which is no longer used. #602 (jlewi)
Add tf-operator.v2 to release.py so that we build a Docker image containing the v1alph2 controller #601 (jlewi)
Add a new command-line argument for release.py #595 (chaoleili)
controller: Remove dup code and use k8s.io/kubernetes/controller #594 (gaocegege)
test: Fix data race problem #593 (gaocegege)
.travis.yml: Fix cmd errors #592 (gaocegege)
Fix the gometalinter support #587 (wgliang)
Format go code and fix spelling errors #585 (wgliang)
docs: Add quick start for v1alpah2 #584 (gaocegege)
mnist: Add correponding yaml config #583 (gaocegege)
pod: Add update logic #582 (gaocegege)
.pylinrc: Add dist_mnist #581 (gaocegege)
.travis.yml: Add failure notification in GitHub #579 (gaocegege)
controller_status: Remove pending pods from active pods #578 (gaocegege)
api: OpenAPI support #577 (gaocegege)
controller_service: Headless service #576 (gaocegege)
replace glide with dep in the developer guide #572 (ChanYiLin)
[v1alpha2]fix bug int to string for index #571 (yph152)
Fix missing string for logging placeholder #570 (zacharyzhao)
Update py_lint and py_test #569 (ankushagarwal)
Update test worker image to kubeflow-ci #568 (ankushagarwal)
chart: Remove #566 (gaocegege)
add OwnerReferences to pdb #565 (ChanYiLin)
Correct typos in README #559 (ntenenz)
vendor: Use dep instead of glide and prune it #557 (gaocegege)
set completion time on success #554 (u2takey)
README: Add tf-operator v1alpha2 design doc #553 (gaocegege)
Add dist mnist model for e2e test #549 (ScorpioCPH)
controller: Refactor controller_pod #548 (gaocegege)
Replace kubeflow-images-staging with kubeflow-images-public #546 (ankushagarwal)
copy labels and anotations to pod from pod template #542 (u2takey)
add workqueue and reflect metrics #541 (zjj2wry)
fix the bug of keeping creating new pdb #539 (ChanYiLin)
signals: Add #531 (gaocegege)
OWNERS: Add @ddysher and @willb as reviewers #529 (gaocegege)
developer_guide: Add instructions for v1alpha2 #528 (gaocegege)
tests: Fix #527 (gaocegege)
v1alpha2: Add implementation #526 (gaocegege)
v1alpha2: update flag kubeconfig #525 (yph152)
v1alpha2: Add API and codegen #523 (gaocegege)
Reenable cluster teardown. #520 (jlewi)
Only identify specific exit codes as retryable error #518 (0olwzo0)
update OWNERS #516 (mitake)
Create a script to release the TFJob operator image #515 (jlewi)
Fix output on test failure #511 (jose5918)
Adds gcloudignore #510 (jose5918)
RFC: Add a new command for generating example TFjobs #509 (mitake)
Use a CentOS 7 base image for the tf-operator image #469 (tmckayus)

training-operator - Initial release of the TFJob operator

Published by jlewi over 6 years ago

gcr.io/kubeflow-images-staging/tf_operator@sha256:1a3d1a2ee90f0108fff3e29023228fc686afbfa311752e8b3bf71859d488b435

v0.1.0 (2018-03-29)

Closed issues:

[v1alpha2] Implement condition update #502
E2E tests timing out; job appears to remain in running state even though job is done. #500
[v1alpha2] TF_CONFIG should be configurable by user #499
[test] All log is 404 in argo #496
Presubmit shows succeeded, but some test actually failed. #479
Waiting pods start too long #461
[test] Add unit test for pkg/controller #455
Create a suitable OWNERS file in /dashboard #443
Tide is misconfigured for this repository. #433
CI failed to setup the cluster #420
[docs] Add dashboard readme #411
Make coverall results advisory and not report as failure #406
Presubmits failing due to lint #404
[enhancement] Fix go vet errors which not caught by the compilers #395
User facing website for Kubeflow that details how to choose a stack #371
[discussion] How to set clusterspec #369
[enhancement] Rename the cmd/tf_operator to cmd/tf-operator #363
Local releaser fails due to version_tag #360
Helm test failure not reported to gubernator #355
[discussion] Whether to create CRD in helm charts #353
Should resourcelock be in the same namespace as controller? #352
Helm test tf-job does not pass validation #351
Move tensorflow/k8s to kubeflow/tf-operator #350
Get rid of TensorBoard replica #347
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs #346
Deprecate the ENV MY_POD_NAMESPACE and MY_POD_NAME #341
[feature] Does tfJob support setting different label/envVar for each worker(replicas >1)? #340
[Discussion] Time to start tagging releases for the TF operator? #339
[discussion] Should group name be tensorflow.org or kubeflow.io or kubeflow.org? #337
dashboard silient error during calling non-existent tfjob #335
in dashboard, silent error when nonexistent namespace is specified #334
Deprecate the IsDefaultPS field #329
[Convention] Replace Tf with TF in CRD #328
Standardise labels for issues and PRs #326
Manage Pods directly instead of using Job controllers #325
TfJobs dashboard not showing jobs #324
TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323
Recreating a failed/successful job with same name doesn't work #322
Releaser incorrectly tags images as "dirty" #321
Reenable the releaser #320
E2E tests are not isolated #318
Need to mark prow job as failed if any tests fail #315
Remove outdated branch wbuchwalter-patch-1 #311
E2E test delete and recreate job with same name #310
TrainingJob.reconcile not called periodically #309
rename master to chief #306
Assign resource quota for TensorBoard #304
Jobs evicted for lack of memory, potentially add resource field to tf-job prototype #301
[Discussion] Operators vs. controller pattern #300
[bug] Add a default pod template for PS #297
Bunch of pylint error messages #294
Fix Head #293
Operator deployment fails post-v20180108-190394d #292
Promote last known good release #290
[bug] metadata.ownerReferences.apiVersion is not set #288
fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284
[bug] Build log 404 in https://prow.k8s.io/?repo=tensorflow%2Fk8s #282
[feature] Seperate the CRD and controller #281
Gaps in test coverage #280
Regression in flag name: controller-config-file #279
[bug] glog before flag.Parse() #275
build new code to new image and find some problem #274
Fix the releaser so we can build new images #270
deploy.py gives gcloud api error '... Version "1.8.1-gke.1" is invalid.' #268
Pods terminated without waiting #267
Attach appropriate header (copyright) to go files #266
suppose i've install the tfjob in my k8s cluster #265
what's the folder pkg for? #264
Build failing because of lint issues #256
what's the main change between version 0.2 and version 0.3? #247
SetupCluster failures unexpected keyword argument 'client_configuration' #242
GPU test marked as succeeded but airflow step is failing #240
Use Kubeflow & ksonnet to install TfJob #239
tf_smoke.py distributed computing doesn't work on minikube #238
example-job can not work in private k8s cluster #233
Test failures aren't properly reported in Gubernator #229
[CRD] Request for input and output dirs in TFJobSpec #224
TfJob should be marked as failed if setup fails #218
panic: runtime error: invalid memory address or nil pointer dereference can not run in k8s 1.8.5 #212
Rethink the TFJob CRD #209
ksonnet configs for deploying the TfJob CRD & Controller #208
Make default TfImage configurable by users #207
refactor the TfJob to use Informer and Controller #206
Use Argo workflow engine for CI/CD or releases #205
Potential issue with Tensorboard / value of simple best-practices example with tboard #202
Investigate using buildah to build our images #201
E2E tests pre & postsubmits are failing #196
Publishing a client to pypi #193
Don't require a master or chief #192
Make cloning the repo and building the artifacts separate commands in py/release.py #189
Handle the case where grpcServerFilePath is the empty string #188
Make Airflow logs accessible #185
Complement docs for Python 3rd party dependencies #181
Helm Test fails because grpcServerFilePath is the empty string #179
Helm should only set --controller_config_file conditionally #175
Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174
no matches for tensorflow.org/, Kind=TfJob #173
Failed to build TFOperator #171
E2E test for GPUs #164
TfJob doesn't work on minikube #160
Deleted jobs re-starting #156
Use coveralls.io to report and check code coverage #155
Clarify scope of tensorflow/k8s #150
After init helm, install chart failed #149
Helm test; insufficient permissions on RBAC clusters #135
Need to trim trailing slash of host string in TfJobRestClient.Watch() #130
results of lint test aren't reported in junit file used by gubernator #126
Collaborators need to be K8s members to trigger tests #122
Extend Test Infrastructure to run multiple E2E tests in parallel #120
initResource() failed; findAllTfJobs returned error: #118
Latest tag on gcr.io is not up to date #116
duplicate #115
postsubmit results aren't showing up in testrgrid #113
TensorBoard replica set not deleted when job deleted. #107
helm permission issue on 1.8.1 #106
Run python unittests as part of pre/post/periodic tests #101
E2E tests are failing #96
E2E Test log should capture output from helm-test #95
Rename TfJob kind to remove mlkube.io #89
Setup travis for tensorflow/k8s #88
Update repo to use its new location tensorflow/k8s #86
mlkube.io -> tensorflow/k8s #85
Update prow to use repo tensorflow/k8s #84
periodic test is failing #83
runner.py needs to create build-log.txt with stdout/stderr of test #82
E2E tests leaking GKE clusters #80
No results show up if you click on mlkube-build-periodic #76
No results show up in prow test grid for presubmit jobs #75
Include TfJob name in labels #72
Simplify/Clarify Accelerators config #71
Clean up examples; don't require cloning the repo #68
How to create TF Jobs from the user side? #67
Change version from beta -> alpha #65
API Review #64
Setup release process for CRD #63
Post submit jobs don't correctly upload artifacts to GCS #62
presubmit test(bootstrap.py) doesn't properly check out PRs #59
E2E Test for default PS server #58
UI / Kubernetes Dashboard Integration #57
E2E test for GPUs #54
Integrate with Prow for Continuous Testing #46
Consider how we manage replicas (stateful sets, managing pods directly) #45
Use K8s Garbage Collection #42
func c.findAllTfJobs() in controller.go will never reach #41
Rename project #34
Structured (Json) logging for Tf Processes #32
Permanent errors don't cause job failure #28
If handling Add event fails, TfJob should be marked as failed with appropriate error #26
Structured Logging For the operator #24
Operator Log Spam; replicas.go:287] No container named: tensorflow found for pod; assuming POD is running #23
Provide a default value for TfPort, replicas, and tfReplicaType #22
Setup continuous build of containers #19
Should this be converted to a Custom Resource Definition (CRD) in anticipation of 1.7 #17
Run TensorFlow server for parameter servers by default #16
TensorBoard Integration #13
Dependency management #7
Better GPU support #6
TfJobRestClient.Create doesn't set kind appropriately #5
Add a creationTimestamp #4

Merged pull requests:

Fix outdated information about GPUs in README #513 (mindprince)
Don't leave pods running when a job completes. #512 (jlewi)
Check running status more gracefully #507 (ScorpioCPH)
test: Add test cases for condition #506 (gaocegege)
test: Fix failed case because of update status #505 (gaocegege)
Add condition logic code #504 (ScorpioCPH)
Fix bug with jobs not being marked as completed. #501 (jlewi)
release: Fix style #498 (gaocegege)
pkg: Fix the code changed in #486 #497 (gaocegege)
Set JSONLogFormat to false by default #495 (ScorpioCPH)
Fix env append issue #494 (ScorpioCPH)
Add dist-mnist for e2e test #493 (ScorpioCPH)
Set restart policy #491 (ScorpioCPH)
test: Add test cases #488 (gaocegege)
Add sleep and random exit image for e2e test #487 (ScorpioCPH)
fixed some golint warning #486 (AK-ayush)
Support testing on minikube. #485 (jlewi)
controller: Add defaulter #483 (gaocegege)
controller: Add check for service and fix service #482 (gaocegege)
controller: Separate ps and worker pods #481 (gaocegege)
controller: Add internal state test #480 (gaocegege)
*: Fix some errors in Travis CI #477 (gaocegege)
controller: Update status in time #476 (gaocegege)
add LabelsByIndex method to eliminate code duplication #474 (rc-zhang)
Make RestartPolicy a property of the ReplicaSpec #473 (ScorpioCPH)
Update tfjob status #472 (ScorpioCPH)
Use headless services for Training jobs #471 (rc-zhang)
Append labels instead of rewriting #468 (ScorpioCPH)
test: Add unit test for controller #467 (gaocegege)
linter: Fix linter ignore file #466 (gaocegege)
Fix field selectors in controller #465 (wbuchwalter)
Run ks upgrade #464 (lluunn)
Import v1alpha2 logic code #463 (ScorpioCPH)
Fix owners file id #462 (lluunn)
Remove deprecated package retryutil #460 (ScorpioCPH)
Change test cluster to kubeflow-ci #459 (lluunn)
Update API to v1alpha2 #457 (ScorpioCPH)
*: Remove APIExtension clientset #454 (gaocegege)
travis: Ignore generated code #453 (gaocegege)
Create PDB of TFReplicaSet for gang scheduling by kube-arbitrator #452 (mitake)
Add OWNERS file for dashboard #446 (wbuchwalter)
Make local release cross-platform + fix #445 (wbuchwalter)
Add proxying to front-end development server. #442 (wbuchwalter)
Fix dashboard + proxy incompatibility #441 (wbuchwalter)
change kubeflow.io to kubeflow.org #440 (Jimexist)
Remove unreachable code #434 (ScorpioCPH)
*: Remove type ContainerName #432 (gaocegege)
add boilerplate header for go file #431 (wackxu)
format the python files with yapf #429 (mitake)
clientset: Fix code which is changed manually #428 (gaocegege)
Delete Dockerfile to build a docker image to use for prow. #425 (jlewi)
Fix setup_cluster. #421 (jlewi)
Add ScorpioCPH as approver/reviewer #419 (ScorpioCPH)
Create resources (Services/Jobs) only once #418 (ScorpioCPH)
Dashboard: Dev Guide #417 (wbuchwalter)
Use logrus for structured logging #416 (ankushagarwal)
Create an initial OWNERS file. #414 (jlewi)
Docs should refer to Kubeflow user guide for deploying the TFJob operrator #412 (jlewi)
Run glide update to update glide.lock #410 (ankushagarwal)
Fix typo in Makefile #409 (ankushagarwal)
Add a field SchedulerName to TFJob for specifying a scheduler #408 (mitake)
Fix lint issues with python3 and a bug in lint script #405 (jlewi)
Support using our E2E workflow to build a Docker image for releases. #403 (jlewi)
add go 1.10 support in travis #402 (Jimexist)
use yapf to format python code #401 (Jimexist)
Fix bug with jobs not working if you recreate a job with same name as previous job #399 (jlewi)
Fixes go vet errors #397 (swiftdiaries)
Fixed-363: Rename cmd/tf_operator -> cmd/tf-operator #393 (AK-ayush)
README: Add community section and quick links #392 (gaocegege)
Remove TensorBoard related code in operator #391 (gaocegege)
Fix something after move to kubeflow/tf-operator #390 (sdf611097)
Add a prow_config.yaml file to configure our prow jobs. #388 (jlewi)
fix a typo in the README file. #387 (ChanYiLin)
*: Replace the repo name #386 (gaocegege)
travis: Add go build command #383 (gaocegege)
config.sh: Remove #381 (gaocegege)
Use ksonnet to easily define TFJobs to be run as tests #374 (jlewi)
Fix repo name env #372 (jose5918)
controller.go: Fix a glog typo #368 (gaocegege)
fix -version option: print version #367 (caogj)
*: Add copyright owner in go files #364 (gaocegege)
Fix local releaser #361 (jose5918)
nit: try to simplify e2e main.go #359 (Jimexist)
Use Argo rather than Airflow to run our E2E tests #358 (jlewi)
Add an option to release.py to specify the tag for the image to use. #357 (jlewi)
Fix helm test #356 (jose5918)
feat(group): Update CRD group to kubeflow.org #354 (gaocegege)
Deprecate the ENV MY_POD_NAME and use default namespace #348 (ScorpioCPH)
feat(crd): Separate CRD and controller #345 (gaocegege)
Create Pod instead of Job #344 (ScorpioCPH)
Deprecate IsDefaultPS in TFJob CRD API #343 (ScorpioCPH)
Update documentation #342 (jose5918)
feat(dashboard): Namespace handling #338 (wbuchwalter)
feat(dashboard): better error handling in dashboard code #336 (Jimexist)
Rename Tf to TF #332 (ScorpioCPH)
Delete binary file #331 (ScorpioCPH)
Take test failures into account when setting prow job status #319 (jlewi)
remove unused file rename.sh #316 (caogj)
add UpdateFunc to handle update events #313 (mqliang)
pkg: Add recorder support #312 (gaocegege)
Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308 (jlewi)
replace TPR with CRD #307 (mqliang)
fix broken link #305 (caogj)
Fix python lint checks #303 (jlewi)
Fix setting defaults. #299 (jlewi)
Add service account name to dashboard if RBAC. #298 (ConnorDoyle)
The flag should be --controller-config-file. #295 (jlewi)
Fix the junit XML file format. #291 (jlewi)
*: Fix API Version #289 (gaocegege)
*: Implement the List interface for TfJobList #278 (gaocegege)
cmd: Fix the flag error caused by pflag #277 (gaocegege)
types.go: Fix CRDKind #276 (gaocegege)
Move around due to new directories layout #273 (ScorpioCPH)
bugfix: set faliures=true if failed deleting configmap #272 (mqliang)
Fix our continuous release process #271 (jlewi)
update initialClusterVersion to 1.7.11-gke.1 #269 (cwbeitel)
Misc Cleanup. #262 (jlewi)
Add proposed directories layout #261 (ScorpioCPH)
record event when tf_operator failover #260 (zjj2wry)
follow kubernetes flag convension #259 (zjj2wry)
refactor dashboard backend, use versioned tfjob clientset #258 (zjj2wry)
apply goimports -w to generated files #257 (Jimexist)
add gometaliner into travis build #254 (Jimexist)
fix(no-dup): reduce dup code in printVersion #253 (Jimexist)
Improve utilities for E2E tests. #251 (jlewi)
Fix leaking of clusters in E2E tests #80 #250 (jlewi)
feat(pipenv): Use pipenv to lock down python dependencies #248 (Jimexist)
fix(lint): add prop types and fix all eslint errors #246 (Jimexist)
refactor code and format imported package #245 (zjj2wry)
feat(lint): apply prettier to format frontend src/ code #244 (Jimexist)
feature(lint): use prettier and lint-staged for frontend javascript code #243 (Jimexist)
Fix issues with tf_job_gpu test #241 (jlewi)
Use the release/test python scripts pulled from the repo. #237 (jlewi)
Don't run glide install in travis builds. #236 (jlewi)
refactor the controller logic #234 (wackxu)
feat(coverage): add covealls support #232 (Jimexist)
use glide install --strip-vendor remove subpackage vendor #231 (zjj2wry)
update k8s dependency to stable version #230 (wackxu)
let tfJob image configurable #228 (zjj2wry)
remove todo, add gitSHA into version information #227 (zjj2wry)
controller.go: Fix a print error #226 (gaocegege)
replace tf-job-operator-config configmap when it already exist #225 (zjj2wry)
Add the vendor directory to the repository. #222 (zjj2wry)
allow using WORKER:0 as chief #221 (lluunn)
Fix issue with handling of json errors. #220 (jlewi)
Set state to failed if there is a problem initializing job #219 (jlewi)
On GKE mounting volumes should no longer be required for GPUs. #217 (jlewi)
update developer guide #216 (ddysher)
Refactor the TfJob to use K8s libraries #215 (wackxu)
Add a basic GPU job test as part of our E2E tests. #213 (jlewi)
minor spelling porxy => proxy #211 (cbockman)
Add terminationPolicy to TfJobSpec #204 (lluunn)
Split cloning the repo and building the images into two steps in our airflow pipeline #200 (jlewi)
Create separate commands to clone and build the repo #199 (jlewi)
Install yarn and nodejs inside the Airflow container. #198 (jlewi)
Update the Airflow deployment to use Docker images built from a clean tree #197 (jlewi)
Fix some cuda issues on Azure #194 (wbuchwalter)
Fixing front page documentation to have grpcServerFilePath #190 (hyperbolic2346)
Add an option to build Docker images with GCB. #187 (jlewi)
replace deprecated tf.initialize_all_variables #184 (DjangoPeng)
build_and_push.py: Support python3 #183 (gaocegege)
tf_job_design_doc: Fix the apiVersion #182 (gaocegege)
py: Add requirements.txt #180 (gaocegege)
resolve a merge conflict imported by commit ae8c31 #178 (DjangoPeng)
tf_job_design_doc.md: Fix a typo #177 (gaocegege)
Fix helm templates so that we don't require a configmap. #176 (jlewi)
replace Google and Golang repos with corresponding github repos #172 (DjangoPeng)
Stop hardcoding namespace for TfJob config map #169 (haitch)
Tooling to make it easier to run a bunch of TfJob tests. #168 (jlewi)
Run python lint and unittests as part of our E2E test pipeline #166 (jlewi)
A binary to run pylint and python unittests #163 (jlewi)
fix dev guide #162 (lluunn)
Integrate Airflow with Prow #158 (jlewi)
rename jlewi/mlkube.io in glide.yaml #153 (moon03432)
add Create(), Delete() in TfJobClient interface #152 (moon03432)
change jobname from task-runtimeid-index to jobname-task-runtimeid-index #151 (moon03432)
Create binaries to run steps in an E2E test pipeline. #148 (jlewi)
Fix a typo in the command line help. #147 (jlewi)
ignore too-many-locals. #146 (jlewi)
On RBAC clusters, test needs a service account with appropriate permissions #145 (jlewi)
Airflow pipeline to run our tests #144 (jlewi)
fix(*): amend the number of worker and ps in example yaml spec for a distributed job #142 (lienhua34)
fix a log issue #141 (moon03432)
rename clus to tfjob in controller.go #138 (moon03432)
rename InClusterConfig() to GetClusterConfig() #137 (moon03432)
Remove trailing slash of host #134 (ScorpioCPH)
Turn release.py into a binary to build the artifacts for all the different contexts #133 (jlewi)
Minor fix typo and redundancy #131 (ScorpioCPH)
Update developer_guide.md #129 (Jimexist)
Use K8s Garbage Collection #127 (jlewi)
Dashboard V1 #125 (wbuchwalter)
More verbose logging of resource deletion #124 (jlewi)
Fix rbac settings in chart. #123 (jlewi)
Fix issue in tpr_util.Delete() #121 (wbuchwalter)
Tag docker images with "latest". #119 (jlewi)
Update API group in the chart #117 (sozercan)
Helm instructions #111 (jlewi)
Name label #105 (jlewi)
Update helm install syntax in readme #104 (sozercan)
Change group to tensorflow.org and version to v1alpha1. #103 (jlewi)
[WIP] Notebook demonstrating use of TfJob on GKE #102 (jlewi)
Fix bugs in the release script. #100 (jlewi)
Fix bugs in the release script. #99 (jlewi)
Update release.py so we can run it continuously. #98 (jlewi)
Fix the E2E test by specifying cloud when deploying the helm package. #97 (jlewi)
Need to set environment to enable Estimators with TF <=1.3 #94 (jlewi)
Update README.md #92 (Jimexist)
Add python lint check to travis and fix python lint issues #91 (jlewi)
#71 Simplify accelerators config #90 (wbuchwalter)
Update test infrastructure to use repo tensorflow/k8s #87 (jlewi)
Create symbolic links in GCS to output of presubmit results. #79 (jlewi)
Fix periodic results (#76) #78 (jlewi)
Another attempt to fix periodic jobs. #77 (jlewi)
Fix location of the post submit results. #74 (jlewi)
Overhaul the documentation #73 (jlewi)
Release scripts #69 (jlewi)
Record latest green from postsubmit #66 (jlewi)
Fix presubmit jobs and periodic jobs #60 (jlewi)
Fix periodic test #56 (jlewi)
Updated chart with batch.jobs and extensions.deployments cluster roles #52 (sozercan)
Added RBAC support for tf-operator chart #51 (sozercan)
PR to test Prow presubmit integration. #50 (jlewi)
E2E test for the CRD #49 (jlewi)
Create configs for setting up Prow for continuous testing. #47 (jlewi)
Fix bug that prevents permanent errors from causing job failure. #44 (jlewi)
Always check for existing TfJobs and instantiate controllers for them. #43 (jlewi)
support multi namespaces #39 (loadwiki)

Use Jinja templates and a Python script to build example Docker images for examples [\#37](https://github.com/kubeflow/tf-operator/pull/37) ([jlewi](https://github.com/jlewi))

Parameter Server: Run TF server by default #36 (wbuchwalter)
Set default values for Replicas, TfPort, TfReplicaType. #31 (jlewi)
Fix a couple bugs. #27 (jlewi)
[WIP] Update to CustomResourceDefinition instead of ThirdPartyResource. #20 (jlewi)
Update glide config. #18 (jlewi)
Add TensorBoard Integration #15 (wbuchwalter)
Changes to support CI using Travis. #14 (jlewi)
Add Environment Variables in Controller Config #12 (wbuchwalter)
Fix tests #11 (wbuchwalter)
Helm charts renaming #10 (wbuchwalter)
Simplify GPU configuration process. #9 (jlewi)
Fix build, add Glide for dependency management. #8 (wbuchwalter)
Update links in README.md #3 (wbuchwalter)
A more thorough E2E test. #2 (jlewi)
Create a helm chart for deploying the TfJob operator #1 (jlewi)