datascience-tools

Various datascience tools bundled in a single container: TensorFlow with GPU support, Jupyter, IPython, Scoop, h5py, pandas, scikit, TFLearn, plotly...

MIT License

Stars
9

Datascience tools container

This container was created to support various experimentations on Datascience, mainly in the context of Kaggle competitions.

Bundled tools:

  • Based on Ubuntu 16.04
  • Python 3
  • Jupyter
  • TensorFlow (CPU and GPU flavors)
  • Spark driver (set SPARK_MASTER ENV pointing to your Spark Master)
  • Scoop, h5py, pandas, scikit, TFLearn, plotly
  • pyexcel-ods, pydicom, textblob, wavio, trueskill, cytoolz, ImageHash...

Run container:

  • CPU only:

    • create docker-compose.yml
    version: "3"
    services:
      datascience-tools:
        image: flaviostutz/datascience-tools
        ports:
          - 8888:8888
          - 6006:6006
        volumes:
          - /notebooks:/notebooks
        environment:
          - JUPYTER_TOKEN=flaviostutz
    
    • docker-compose up
  • GPU support for TensorFlow:

    • Prepare host machine with NVIDIA Cuda drivers
    • Install nvidia-docker and nvidia-docker-plugin
      • wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0/nvidia-docker_1.0.0-1_amd64.deb
      • sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
      • Install nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
    • nvidia-docker run -d -v /root:/notebooks -v /root/input:/notebooks/input -v /root/output:/notebooks/output -p 8888:8888 -p 6006:6006 --name jupyter flaviostutz/datascience-tools:latest-gpu
  • If you wish this container to run automatically on host boot, add these lines to /etc/rc.local:

    • cd /root/datascience-tools/run ./boot.sh >> /var/log/boot-script
    • Change "/root/datascience-tools" to where you cloned this repo

Access:

Autorun script

  • When this container starts, it runs:
    • Jupyter Notebook server on port 8888
    • TensorBoard server on port 6006
    • A custom script located at /notebooks/autorun.sh
      • If autorun.sh doesn't exist, it is ignored
      • If it exists, everytime you start/restart the container it will be run once
      • You can use this script when running large batch processes on servers that could boot/shutdown at random (like what happens when using AWS Spot Instances), so that when the server restarts this script can resume previous work
      • Make sure you control partial save/resume for optimal computing usage
      • On the host OS, you have to run this docker container with "--restart=always" so that it will be started automatically during boot
      • It is possible to edit this file with Jupyter editor
      • Example script:
        • #!/bin/bash python test.py

Build instructions

  • docker build . -f Dockerfile
  • docker build . -f Dockerfile-gpu

Tips for development of your own Notebooks

  • A good practice is to store your notebook scripts in a git repository

  • Run datascience-tools container and map the volume "/notebooks", inside the container, to the path you cloned your git repository in your computer

  • You can edit/save/run the scripts from the web interface (http://localhost:8888) or directly with other tools on your computer. You can commit and push your code to the repository directly (no copy from/to container is needed because the volume is mapped)

version: "3"
services:
   datascience-tools:
      image: flaviostutz/datascience-tools
      ports:
      - 8888:8888
      - 6006:6006
      volumes:
      - /Users/flaviostutz/Documents/development/flaviostutz/puzzler/notebooks:/notebooks
  • For running in production, create a new container with "FROM flaviostutz/datascience-tools" and add your script files to "/notebooks" so when you run the container it will have your custom scripts embedded into it. No "volume" mapping is needed for this container. During container startup, script /notebooks/autorun.sh will run if present.

ENVs variables

  • JUPYTER_TOKEN - token needed for the users to open Jupyter. defaults to '', so that no token or password will asked to the user

  • SPARK_MASTER - Spark master address. Used if you want to send jobs to an external Spark cluster and still control the whole job from Jupyter Notebook itself.