horovod

Bot releases are visible (Hide)

horovod - Hotfix for horovodrun network interface discovery process

Published by tgaddair over 4 years ago

Fixed

Fixed issue with network interface discovery causing horovodrun to fail during startup (#1974).

horovod - Platform LSF support, Spark on Gloo, and Sync Batch Norm

Published by tgaddair over 4 years ago

Highlights

Added Platform LSF and jsrun support to horovodrun. (#1805)
Added support for running Horovod on Spark with Gloo in place of MPI. (#1807)
Added synchronous batch normalization for horovod.torch API. (#1923)

Additional changes

Added support for providing a set of inclusive NICs to horovodrun. (#1808)
Added optional initial_lr parameter to LearningRateScheduleCallback, deprecated implicit initialization. (#1933)
Changed Spark Estimators to use Petastorm BatchDataLoader. (#1879)
Changed Spark Estimators to use Petastorm's make_reader API. (#1804)
Improved latency of background thread loop. (#1880)
Enabled setting Horovod background thread affinity with all frameworks. (#1881)
Added verbose parameter to SparkBackend. (#1922)
Use parameter names when scheduling broadcasts in MXNet broadcast_parameters. (#1894)
Added metadata cache with calling fit_on_parquet. (#1826)
Added optional local version to package version. (#1925)

Bugfixes

Fixed module resolution for tf.keras optimizers when calling hvd.load_model. (#1935)
Modified safe_shell_exec to use multiprocessing spawn instead of fork to prevent deadlocks. (#1915)
Fixed multiprocessing to support Python 3.8. (#1904)
Added extra preprocessor guard for FMA optimization. (#1835)
Fixed exception in KerasEstimator when num_proc is larger than 4. (#1945)
Fixed memory leaks. (#1845)
Fixed a bug with sample weight in TorchEstimator. (#1790)
Removed torchvision from pytorch extra. (#1899)

horovod - TensorFlow 2.2 compatibility, MPI args for horovodrun

Published by tgaddair over 4 years ago

TensorFlow

Added _aggregate_gradients in DistributedOptimizer to support Keras in TensorFlow 2.2. (#1784)

PyTorch

Removed uses of deprecated PyTorch C++ API to support PyTorch 1.6. (#1731)

Changes to `horovodrun`

Added process binding arguments to horovodrun. (#1767)
Added --tcp flag to horovodrun for TCP only communication. (#1744)

Changes to installer

Added HOROVOD_BUILD_ARCH_FLAGS to specify architecture-specific compiler flags. (#1751)
Added Python extras to enforce that Horovod is installed after other frameworks. (#1785)

API changes

Added mpi_args to horovod.run.run. (#1787)
Added support for data transformation before train and validation in TorchEstimator (#1750)

Bugs

Fixed bug in cache dump. (#1739)
Fixed root rank output handling in MXNet out-of-place broadcast. (#1740)
Fixed data_type_to_str for SparseVector and DenseVector. (#1780)

horovod - Horovod Spark Estimators, Spark 3, Join, Interactive Run

Published by tgaddair almost 5 years ago

In version 0.19.0, Horovod adds tighter integration with Apache Spark, including a new high-level Horovod Spark Estimator framework and support for accelerator-aware task-level scheduling in the upcoming Spark 3.0 release. This release also contains experimental new features including a join operation for PyTorch and the ability to launch Horovod jobs programmatically from environments like notebooks using a new interactive run mode.

Horovod Spark Estimators (#1554)

To bridge the gap between large-scale data processing in Spark and large-scale deep learning training with Horovod, we’re introducing a new open source API called Horovod Spark Estimators.

With Horovod Spark Estimators, you can train your deep neural network directly on your existing Spark DataFrame, leveraging Horovod’s ability to scale to hundreds of workers in parallel without any specialized code for distributed training:

from tensorflow import keras
import tensorflow as tf
import horovod.spark.keras as hvd

model = keras.models.Sequential()
    .add(keras.layers.Dense(8, input_dim=2))
    .add(keras.layers.Activation('tanh'))
    .add(keras.layers.Dense(1))
    .add(keras.layers.Activation('sigmoid'))

# NOTE: unscaled learning rate
optimizer = keras.optimizers.SGD(lr=0.1)
loss = 'binary_crossentropy'

store = HDFSStore('/user/username/experiments')
keras_estimator = hvd.KerasEstimator(
    num_proc=4,
    store=store,
    model=model,
    optimizer=optimizer,
    loss=loss,
    feature_cols=['features'],
    label_cols=['y'],
    batch_size=32,
    epochs=10)


keras_model = keras_estimator.fit(train_df) \
    .setOutputCols(['predict'])
predict_df = keras_model.transform(test_df)

Horovod Spark Estimators provide a single abstraction — the Estimator — which hides the complexity of gluing Spark DataFrames to a deep learning training script, reading data into a format interpretable by the training framework, and distributing the training using Horovod. The user only needs to provide a model written in the deep learning framework of their choice, and the Estimator will do the work of fitting it to the DataFrame.

After training, the Estimator returns a Transformer representation of the trained model. The model transformer can be used like any Spark ML transformer to make predictions on an input DataFrame, writing them as new columns in the output DataFrame.

Estimators can be used to track experiment history through model checkpointing, hot start retraining, and metric logging (for Tensorboard) using the Estimator Store abstraction. Stores persist all training artifacts including intermediate representations of the training data. Horovod natively supports stores for HDFS and local filesystems.

Horovod Spark Estimators are available for Keras (both tf.keras and standalone Keras) and PyTorch, with more frameworks (including pure TensorFlow) coming soon.

Spark 3.0 task-level GPU scheduling (#1584)

Apache Spark 3.0 introduces a new accelerator-aware scheduling capability, allowing a production ETL job running on CPUs to hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline, breaking down the barriers between ETL and continuous model training.

Horovod users can now request GPU resources directly from their Spark application, without assuming which tasks should map to which GPUs:

import horovod.spark

def train():
    from horovod.spark.task import get_available_devices
    import horovod.tensorflow.keras as hvd

    hvd.init()
    
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.visible_device_list = get_available_devices()[0]
    K.set_session(tf.Session(config=config))

    ...

horovod.spark.run(train)

Check out the keras_spark3_rossmann.py script for a complete example.

Spark 3.0 is currently in preview release, with the full release forthcoming.

Join Operation for PyTorch (#1058)

The ability for different workers to train on a different number of batches in each epoch has been one of the most requested features for Horovod. This problem frequently arises when a dataset doesn’t evenly split among all workers, requiring the user to truncate any extra examples or risk deadlock during training.

With the new join operation, users no longer need to worry about how evenly their dataset divides when training. Just add a join step at the end of each epoch, and Horovod will train on any extra batches without causing the waiting workers to deadlock:

for epoch in range(epochs):
    for batch in dataset:
        ...
    hvd.join(device=hvd.local_rank())

The join operation is currently supported in Horovod for PyTorch, with support for TensorFlow and Apache MXNet coming soon.

Interactive Run Mode (#1307)

With horovod.spark.run, Horovod was made to support launching training jobs programmatically by defining Python functions that were executed on Spark Executors. Within Horovod Interactive Run Mode, we created a similar API that can launch training jobs on any visible hosts, similar to the command-line horovodrun tool:

from horovod.run import run as hvdrun

def train():
    import horovod.tensorflow as hvd
    hvd.init()
    ...

results = hvdrun(train, np=2)

Interactive mode supports most of the functionality provided by horovodrun. See the API for a complete reference.

Bug Fixes and Improvements

Added NCCL implementation of hvd.broadcast when building with HOROVOD_GPU_BROADCAST=NCCL (#1579).
Fixed hvd.allgather to work with CUDA tensors when building with HOROVOD_GPU_ALLGATHER=MPI (#1480).
Fixed a crash bug in MXNet caused by early free of tensor object (#1639).
Added experimental implementation for the Adasum gradient aggregation method from Microsoft (full support coming in v0.20.0) (#1485).
Added support for Intel oneCCL to replace MLSL (#1566).
Added FP16 support in IBM DDL (#1465).
Improved support for running Horovod on Spark with YARN (#1525).
Added support for TensorFlow 2.0 learning rate schedules with tf.keras (#1588).
Added support for broadcasting Python objects with PyTorch (#1609).
Added thread pool for CUDA finalizer threads (#1562).
Fixed host file usage and parsing within horovodrun (#1607).