This repo is for beginner who want to learn and use Submarine
APACHE-2.0 License
Submarine is a new subproject of Apache Hadoop.
Submarine is a project which allows infra engineer / data scientist to run unmodified Tensorflow or PyTorch programs on YARN or Kubernetes.
Goals of Submarine:
There is no complete and easy to understand example for beginner, and Submarine support many open source infrastructure, it's hard to deploy each runtime environment for engineer, not to mention data sciences
This repo is aim to let user easily deploy container orchestrations (like Hadoop Yarn, k8s) by docker container, support full distributed deep learning example for each runtimes, and step by step tutorial for beginner.
A fast and easy way to deploy Submarine on your laptop.
With just a few clicks, you are up for experimentation, and for running complete Submarine experiment.
mini-submarine includes:
docker build --tag hello-submarine ./mini-submarine
docker run -it -h submarine-dev --name mini-submarine --net=bridge --privileged -P hello-submarine /bin/bash
docker pull pingsutw/hello-submarine
docker run -it -h submarine-dev --name mini-submarine --net=bridge --privileged -P pingsutw/hello-submarine /bin/bash
pwd # /home/yarn/submarine
. ./venv/bin/activate
# change directory
cd ..
cd tests
# run locally
python run_deepfm.py -conf deepfm.json -task train
python run_deepfm.py -conf deepfm.json -task evaluate
# Model metrics : {'auc': 0.64110434, 'loss': 0.4406755, 'global_step': 12}
# run distributedly
export SUBMARINE_VERSION=0.6.0-SNAPSHOT
export SUBMARINE_HADOOP_VERSION=2.9
export SUBMARINE_JAR=/opt/submarine-dist-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}/submarine-dist-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar
java -cp $(${HADOOP_COMMON_HOME}/bin/hadoop classpath --glob):${SUBMARINE_JAR}:${HADOOP_CONF_PATH} \
org.apache.submarine.client.cli.Cli job run --name deepfm-job-001 \
--framework tensorflow \
--verbose \
--input_path "" \
--num_workers 2 \
--worker_resources memory=2G,vcores=4 \
--num_ps 1 \
--ps_resources memory=2G,vcores=4 \
--worker_launch_cmd "myvenv.zip/venv/bin/python run_deepfm.py -conf=deepfm_distributed.json" \
--ps_launch_cmd "myvenv.zip/venv/bin/python run_deepfm.py -conf=deepfm_distributed.json" \
--insecure \
--conf tony.containers.resources=../submarine/myvenv.zip#archive,${SUBMARINE_JAR},deepfm_distributed.json,run_deepfm.py
Deploy all component on K8s, including
curl -Lo ./kind "https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-$(uname)-amd64"
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind
kind create cluster --image kindest/node:v1.15.6 --name k8s-submarine
kubectl create namespace submarine
# set submarine as default namspace
kubectl config set-context --current --namespace=submarine
curl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
kubectl version --client
curl https://helm.baltorepo.com/organization/signing.asc | sudo apt-key add -
sudo apt-get install apt-transport-https --yes
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm
helm install submarine ./helm-charts/submarine
kubectl port-forward svc/submarine-server 8080:8080
# open workbench http://localhsot:8080
# Account: admin
# Password: admin
curl -X POST -H "Content-Type: application/json" -d '
{
"meta": {
"name": "tf-mnist-json",
"namespace": "submarine",
"framework": "TensorFlow",
"cmd": "python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150",
"envVars": {
"ENV_1": "ENV1"
}
},
"environment": {
"image": "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0"
},
"spec": {
"Ps": {
"replicas": 1,
"resources": "cpu=1,memory=512M"
},
"Worker": {
"replicas": 1,
"resources": "cpu=1,memory=512M"
}
}
}
' http://127.0.0.1:32080/api/v1/experiment
TBD