🎱 A demonstration of existing machine learning toolkits on Kubernetes
This repo serves as an example to demonstrate the typical machine learning workflow and how to leverage existing machine learning toolkits for Kubernetes to enhance the development and operations lifecycle.
Here is what you will find in this repo:
This example is based on the Tensorflow Image Retraining example.
We have modified the example to retrain inception v3 model to identify a particular celebrity. Using the retrained model, we can get predictions like the following:
Train with Inception v3 and Automate with containers
# build
docker build -t ritazh/image-retrain-kubecon:1.9-gpu -f train/Dockerfile.gpu ./train
# push
docker push ritazh/image-retrain-kubecon:1.9-gpu
# run
docker run --rm -v $PWD/tf-output:/tf-output ritazh/image-retrain-kubecon:1.9-gpu "--how_many_training_steps=4000" "--learning_rate=0.01" "--bottleneck_dir=/tf-output/bottlenecks" "--model_dir=/tf-output/inception" "--summaries_dir=/tf-output/training_summaries/baseline" "--output_graph=/tf-output/retrained_graph.pb" "--output_labels=/tf-output/retrained_labels.txt" "--image_dir=images" "--saved_model_dir=/tf-output/saved_models/1"
Visualize with Tensorboard
# build
docker build -t ritazh/tensorboard:1.9 -f ./train/Dockerfile.tensorboard ./train
# push
docker push ritazh/tensorboard:1.9
# run
docker run -d --name tensorboard -p 80:6006 --rm -v $PWD/tf-output:/tf-output ritazh/tensorboard:1.9 "--logdir" "/tf-output/training_summaries"
Serve Trained Model for Inference with TF Serving
docker run -d --rm --name serving_base tensorflow/serving
docker cp tf-output/saved_models serving_base:/models/inception
docker commit --change "ENV MODEL_NAME inception" serving_base $USER/inception_serving
docker kill serving_base
docker run -p 8500:8500 -t $USER/inception_serving &
python serving/inception_client.py --server localhost:8500 --image test/fbb1.jpeg
You should get an output as follows:
outputs {
key: "classes"
value {
dtype: DT_STRING
tensor_shape {
dim {
size: 2
}
}
string_val: "fbb"
string_val: "notfbb"
}
}
outputs {
key: "prediction"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 1
}
dim {
size: 2
}
}
float_val: 0.95451271534
float_val: 0.0454873144627
}
}
model_spec {
name: "inception"
version {
value: 1
}
signature_name: "serving_default"
}
Using acs-engine with kubernetes v1.11.4
Install ksonnet version 0.13.1, or you can download a prebuilt binary for your OS.
# install ks v0.13.1
export KS_VER=0.13.1
export KS_PKG=ks_${KS_VER}_linux_amd64
wget -O /tmp/${KS_PKG}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz --no-check-certificate
mkdir -p ${HOME}/bin
tar -xvf /tmp/$KS_PKG.tar.gz -C ${HOME}/bin
echo "PATH=$PATH:${HOME}/bin/$KS_PKG" >> ~/.bashrc
source ~/.bashrc
Install argo CLI
Run the following commands to deploy Kubeflow components in your Kubernetes cluster:
NOTE: This demo has been updated to use kfctl.sh.
# download kubeflow
KUBEFLOW_SOURCE=kubeflow # directory to download the kubeflow source
mkdir ${KUBEFLOW_SOURCE}
cd ${KUBEFLOW_SOURCE}
export KUBEFLOW_TAG=v0.4.1 # tag corresponding to kubeflow version
curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
# init kubeflow app
KFAPP=mykubeflowapp
${KUBEFLOW_SOURCE}/scripts/kfctl.sh init ${KFAPP} --platform none
# generate kubeflow app
cd ${KFAPP}
${KUBEFLOW_SOURCE}/scripts/kfctl.sh generate k8s
# apply kubeflow to cluster
${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
# view created components
kubectl get po -n kubeflow
kubectl get crd
kubectl get svc -n kubeflow
Setup PVC components to persist data in pods. https://docs.microsoft.com/en-us/azure/aks/azure-disks-dynamic-pv
export RG_NAME=kubecon
export STORAGE=kubeconstorage
az storage account create --resource-group $RG_NAME --name $STORAGE --sku Standard_LRS
# kubectl create -f ./deployments/azure-pvc-roles.yaml
kubectl create -f ./deployments/azure-file-sc.yaml
kubectl create -f ./deployments/azure-file-pvc.yaml
Check status
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
azurefile Bound pvc-d686be3e-bc75-11e8-a08d-000d3a4f8d49 5Gi RWX azurefile 4h
[OPTIONAL] If you want to use a static Azure files instead of creating PVCs,
kubectl create secret generic azure-file-secret --from-literal=azurestorageaccountname=$STORAGE_ACCOUNT_NAME --from-literal=azurestorageaccountkey=$STORAGE_KEY
volumes:
- name: azure-files
azureFile:
secretName: azure-files-secret
shareName: data
readOnly: false
Deploy TFJob and tensorboard
Run training job with TFJob
kubectl create -f ./deployments/tfjob-retrain.yaml
# check created tfjob
kubectl get tfjob
# check created pvc
kubectl get pvc
# check status of the training
kubectl logs tfjob-retrain-master-0
# after completed, clean up:
kubectl delete tfjob tfjob-retrain
Run Tensorboard (after the previous training is completed)
kubectl create -f ./deployments/tfjob-tensorboard.yaml
# Get public IP of tensorboard service
kubectl get svc
Clean up
kubectl delete -f ./deployments/tfjob-tensorboard.yaml
This step requires Azure Files mount to be available. Please refer to the Persist Data and Logs With Azure Storage section.
Ensure Helm and Virtual Kubelet are installed
helm version
Provide the following in hyperparameter/virtual-kubelet/values.yaml
targetAKS:
clientId:
clientKey:
tenantId:
subscriptionId:
aciRegion:
Then install the Virtual Kubelet chart in your cluster.
export VK_RELEASE=virtual-kubelet-latest
CHART_URL=https://github.com/virtual-kubelet/virtual-kubelet/raw/master/charts/$VK_RELEASE.tgz
helm install --name vk "$CHART_URL" -f ./hyperparameter/virtual-kubelet/values.yaml
kubectl get nodes
...
virtual-kubelet Ready agent 5s v1.11.2
Deploy hyperparameter chart to run our experiments in parallel on Azure Container Instance
helm install --name image-retrain-hyperparam ./hyperparameter/chart
az container list -g <ACI RESOURCE GROUP>
[OPTIONAL] If you do not want to use the minio component deployed as part of Kubeflow. Deploy your own Minio for S3 compatibility
- name: MINIO_ACCESS_KEY
value: "PUT AZURE STORAGE ACCOUNT NAME HERE"
- name: MINIO_SECRET_KEY
value: "PUT AZURE STORAGE ACCOUNT KEY HERE"
# Deploy Minio
kubectl create -f ./deployments/minio-azurestorage.yaml
# Get endpoint of Minio server
kubectl get service minio-service
Create a pipeline with Argo workflow
# namespace of all the kubeflow components
export NAMESPACE=kubeflow
# set to "minio" if using minio shipped as part of kubeflow
# set to "AZURE STORAGE ACCOUNT NAME" if minio is deployed as a proxy
# to Azure storage account from the previous step
export AZURE_STORAGEACCOUNT_NAME=minio
# set to "minio123" if using minio shipped as part of kubeflow
# set to "AZURE STORAGE ACCOUNT KEY" if minio is deployed as a proxy
# to Azure storage account from the previous step
export AZURE_STORAGEACCOUNT_KEY=minio123
MINIOIP=$(kubectl get svc minio-service -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
MINIOPORT=$(kubectl get svc minio-service -n ${NAMESPACE} -o jsonpath='{.spec.ports[0].port}')
export S3_ENDPOINT=${MINIOIP}:$MINIOPORT
export AWS_ENDPOINT_URL=${S3_ENDPOINT}
export AWS_ACCESS_KEY_ID=$AZURE_STORAGEACCOUNT_NAME
export AWS_SECRET_ACCESS_KEY=$AZURE_STORAGEACCOUNT_KEY
export BUCKET_NAME=mybucket
export DOCKER_BASE_URL=docker.io/ritazh # Update this to fit your scenario
export S3_DATA_URL=s3://${BUCKET_NAME}/data/retrain/
export S3_TRAIN_BASE_URL=s3://${BUCKET_NAME}/models
export AWS_REGION=us-east-1
export JOB_NAME=myjob-$(uuidgen | cut -c -5 | tr '[:upper:]' '[:lower:]')
export TF_MODEL_IMAGE=${DOCKER_BASE_URL}/image-retrain-kubecon:1.9-gpu # Retrain image from previous step
export TF_WORKER=3
export MODEL_TRAIN_STEPS=200
# Create a secret for accessing the Minio server
kubectl create secret generic aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \
--from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY} -n ${NAMESPACE}
# Create a user for the workflow
kubectl apply -f workflow/tf-user.yaml -n ${NAMESPACE}
# Submit a workflow to argo
argo submit workflow/model-train-serve-workflow.yaml -n ${NAMESPACE} --serviceaccount tf-user \
-p aws-endpoint-url=${AWS_ENDPOINT_URL} \
-p s3-endpoint=${S3_ENDPOINT} \
-p aws-region=${AWS_REGION} \
-p tf-model-image=${TF_MODEL_IMAGE} \
-p s3-data-url=${S3_DATA_URL} \
-p s3-train-base-url=${S3_TRAIN_BASE_URL} \
-p job-name=${JOB_NAME} \
-p tf-worker=${TF_WORKER} \
-p model-train-steps=${MODEL_TRAIN_STEPS} \
-p namespace=${NAMESPACE} \
-p tf-tensorboard-image=tensorflow/tensorflow:1.7.0 \
-p s3-use-https=0 \
-p s3-verify-ssl=0
# Check status of the workflow
argo list -n ${NAMESPACE}
NAME STATUS AGE DURATION
tf-workflow-s8k24 Running 5m 5m
# Check pods that are created by the workflow
kubectl get pod -n ${NAMESPACE} -o wide -w
# Monitor training from tensorboard
PODNAME=$(kubectl get pod -n ${NAMESPACE} -l app=tensorboard-${JOB_NAME} -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward ${PODNAME} -n ${NAMESPACE} 6006:6006
# Get logs from the training pod(s)
kubectl logs ${JOB_NAME}-master-0 -n ${NAMESPACE}
# Get public ip of the serving service once we reach the last stage of the workflow
SERVINGIP=$(kubectl get svc inception-${JOB_NAME} -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Get detailed information of the workflow
argo get tf-workflow-s8k24 -n ${NAMESPACE}
Name: tf-workflow-s8k24
Namespace: tfworkflow
ServiceAccount: tf-user
Status: Succeeded
Created: Tue Oct 30 23:44:27 -0700 (26 minutes ago)
Started: Tue Oct 30 23:44:27 -0700 (26 minutes ago)
Finished: Tue Oct 30 23:53:56 -0700 (16 minutes ago)
Duration: 9 minutes 29 seconds
Parameters:
aws-endpoint-url: 40.76.11.177:9012
s3-endpoint: 40.76.11.177:9012
aws-region: us-east-1
tf-model-image: docker.io/ritazh/image-retrain-kubecon:1.9-gpu
s3-data-url: s3://mybucket/data/retrain/
s3-train-base-url: s3://mybucket/models
job-name: myjob-b5e18
tf-worker: 3
model-train-steps: 200
namespace: tfworkflow
tf-tensorboard-image: tensorflow/tensorflow:1.7.0
s3-use-https: 0
s3-verify-ssl: 0
tf-ps: 2
tf-serving-image: elsonrodriguez/model-server:1.6
ks-image: elsonrodriguez/ksonnet:0.10.1
model-name: inception
model-hidden-units: 100
model-batch-size: 100
model-learning-rate: 0.01
model-serving: true
model-serving-servicetype: LoadBalancer
model-serving-ks-url: github.com/kubeflow/kubeflow/tree/master/kubeflow
model-serving-ks-tag: 1f474f30
aws-secret: aws-creds
STEP PODNAME DURATION MESSAGE
tf-workflow-s8k24
--- get-workflow-info
--- tensorboard
--- train-model
--- serve-model tf-workflow-s8k24-3774882124 1m
onExit
--- cleanup tf-workflow-s8k24-1936931737 1m
Run test client against the Serving endpoint to get predictions Using the serving IP from the previous step, run the following:
python serving/inception_client.py --server ${SERVINGIP}:9000 --image test/fbb1.jpeg
# You will see the following output
outputs {
key: "classes"
value {
dtype: DT_STRING
tensor_shape {
dim {
size: 2
}
}
string_val: "fbb"
string_val: "notfbb"
}
}
outputs {
key: "prediction"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 1
}
dim {
size: 2
}
}
float_val: 0.96350902319
float_val: 0.0364910177886
}
}
model_spec {
name: "inception"
version {
value: 1
}
signature_name: "serving_default"
}
Ensure the JupyterHub component is running and the service has a public IP
kubectl get pod -n ${NAMESPACE} -l app=tf-hub
NAME READY STATUS RESTARTS AGE
tf-hub-0 1/1 Running 0 30m
kubectl get svc -n ${NAMESPACE} -l app=tf-hub-lb
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tf-hub-lb LoadBalancer 10.0.91.30 137.xx.xx.xx 80:31137/TCP 1d
Create an OAuth app on GitHub. Use the external IP of tb-hub-lb.
By default JupyterHub uses dummy authenticator to use plaintext username and password for login. Edit the JupyterHub configmap to enable GitHub OAuth.
kubectl edit configmap jupyterhub-config -n ${NAMESPACE}
# Replace the dummyauthenticator with GitHubOAuthenticator
# c.JupyterHub.authenticator_class = 'dummyauthenticator.DummyAuthenticator'
c.JupyterHub.authenticator_class = GitHubOAuthenticator
c.GitHubOAuthenticator.oauth_callback_url = 'http://137.xx.xx.xx/hub/oauth_callback'
c.GitHubOAuthenticator.client_id = 'GET THIS FROM GITHUB DEVELOPER SETTING'
c.GitHubOAuthenticator.client_secret = 'GET THIS FROM GITHUB DEVELOPER SETTING'
Restart the tf-hub-0 pod
Launch JupyterHub from a browser using the external IP from the tf-hub-lb service. Everyone on the team can now create their own Jupyter Notebook instance after signing in with their github account and selecting the resources they need to create their own Jupyter Notebook instance.