OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
APACHE-2.0 License
English version | 中文版
OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration.
Nowadays, many machine learning and deep learning applications are built based on parameter servers, which are used to efficiently store and update model weights. When a model has a large number of sparse features (e.g., Wide&Deep and DeepFM for CTR prediction), the number of weights easily runs into billions to trillions. In such a case, the tradition synchronization solutions (such as the Allreduce-based solution adopted by Horovod) are unable to achieve high-performance because of massive communication overhead introduced by a tremendous number of sparse features. In order to achieve efficiency for such sparse models, we develop OpenEmbedding, which enhances the parameter server especially for the sparse model training and inference.
Efficiency
Ease-of-use
Adaptability
For models that contain sparse features, it is difficult to speed up using the Allreduce-based framework Horovod. Using both OpenEmbedding and Horovod can get better acceleration effects. In the single 8 GPU scene, the speedup ratio is 3 to 8 times. Many models achieved 3 to 7 times the performance of Horovod.
You can install and run OpenEmbedding by the following steps. The examples show the whole process of training criteo data with OpenEmbedding and predicting with Tensorflow Serving.
NVIDIA docker is required to use GPU in image. The OpenEmbedding image can be obtained from Docker Hub.
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
docker run --rm --gpus all -v /tmp/criteo:/openembedding/tmp/criteo \
4pdosc/openembedding:latest examples/run/criteo_deepctr_standalone.sh
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v /tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
docker run --rm --network host 4pdosc/openembedding:latest examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
# Install the dependencies required by OpenEmbedding.
apt update && apt install -y gcc-7 g++-7 python3 libpython3-dev python3-pip
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding
# Install the dependencies required by examples.
apt install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py
# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
# Install the dependencies required by OpenEmbedding.
yum install -y centos-release-scl
yum install -y python3 python3-devel devtoolset-7
scl enable devtoolset-7 bash
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding
# Install the dependencies required by examples.
yum install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py
# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
The installation usually requires g++ 7 or higher, or a compiler compatible with tf.version.COMPILER_VERSION
. The compiler can be specified by environment variable CC
and CXX
. Currently OpenEmbedding can only be installed on linux.
CC=gcc CXX=g++ pip3 install openembedding
If TensorFlow was updated, you need to reinstall OpenEmbedding.
pip3 uninstall openembedding && pip3 install --no-cache-dir openembedding
A sample program for common usage is as follows.
Create Model
and Optimizer
.
import tensorflow as tf
import deepctr.models import WDL
optimizer = tf.keras.optimizers.Adam()
model = WDL(feature_columns, feature_columns, task='binary')
Transform to distributed Model
and distributed Optimizer
. The Embedding
layer will be stored on the parameter server.
import horovod as hvd
import openembedding.tensorflow as embed
hvd.init()
optimizer = embed.distributed_optimizer(optimizer)
optimizer = hvd.DistributedOptimizer(optimizer)
model = embed.distributed_model(model)
Here, embed.distributed_optimizer
is used to convert the TensorFlow optimizer into an optimizer that supports the parameter server, so that the parameters on the parameter server can be updated. The function embed.distributed_model
is to replace the Embedding
layers in the model and override the methods to support saving and loading with parameter servers. Method Embedding.call
will pull the parameters from the parameter server and the backpropagation function was registered to push the gradients to the parameter server.
Data parallelism by Horovod.
model.compile(optimizer, "binary_crossentropy", metrics=['AUC'],
experimental_run_tf_function=False)
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback() ]
model.fit(dataset, epochs=10, verbose=2, callbacks=callbacks)
Export as a stand-alone SavedModel so that can be loaded by TensorFlow Serving.
if hvd.rank() == 0:
# Must specify include_optimizer=False explicitly
model.save_as_original_model('model_path', include_optimizer=False)
More examples as follows.
Embedding
layerdocker build -t 4pdosc/openembedding-base:0.1.0 -f docker/Dockerfile.base .
docker build -t 4pdosc/openembedding:0.0.0-build -f docker/Dockerfile.build .
The compiler needs to be compatible with tf.version.COMPILER_VERSION
(>= 7), and install all prpc dependencies to tools
or /usr/local
, and then run build.sh
to complete the compilation. The build.sh
will automatically install prpc (pico-core) and parameter-server (pico-ps) to the tools
directory.
git submodule update --init --checkout --recursive
pip3 install tensorflow
./build.sh clean && ./build.sh build
pip3 install ./build/openembedding-*.tar.gz
TensorFlow 2
dtype
: float32
, float64
.tensorflow.keras.initializers
RandomNormal
, RandomUniform
, Constant
, Zeros
, Ones
.seed
is currently ignored.tensorflow.keras.optimizers
Adadelta
, Adagrad
, Adam
, Adamax
, Ftrl
, RMSprop
, SGD
.decay
and LearningRateSchedule
are not supported.Adam(amsgrad=True)
is not supported.RMSProp(centered=True)
is not supported.Optimizer
with momentum.tensorflow.keras.layers.Embedding
input_dim
and hash table for unknown input_dim
(2**63 range).embeddings_regularizer
, embeddings_constraint
.tensorflow.keras.Model
Model
and automatically ignore or convert incompatible settings (such as embeddings_constraint
).save
, save_weights
, load_weights
and ModelCheckpoint
.Model
as a stand-alone SavedModel, which can be load by TensorFlow Serving.Model
s in one task.MirroredStrategy
or MultiWorkerMirroredStrategy
is experimental.tf.feature_column.embedding_column
embedding_regularizer
, LearningRateSchedule
and etc.Initializer
and Optimizer
Model
s in one taskCurrently, the interface for persistent memory is experimental. PMem-based OpenEmbedding provides a lightweight checkpointing scheme as well as the comparable performance with its DRAM version. For long-running deep learning recommendation model training, PMem-based OpenEmbedding provides not only an efficient but also a reliable training process.