Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
OTHER License
Bot releases are hidden (Show)
Published by maxhgerlach over 1 year ago
Published by maxhgerlach over 1 year ago
get_local_and_global_gradients
to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)tf.logical_and
within allreduce tf.cond
runs on CPU. (#3885)CUDA_VISIBLE_DEVICES
environment variable is no longer passed to remote nodes. (#3865)Published by maxhgerlach over 1 year ago
PartialDistributedOptimizer
API. (#3738)HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX
environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. (#3737)prescale_factor
and postscale_factor
and moved averaging into Horovod backend. (#3815)MPI_GPUAllgather
. (#3727)SyncBatchNormalization._moments()
. (#3775)tf.IndexedSlices
types when scaling local gradients. (#3786)MEMCPY_IN_FUSION_BUFFER
timeline event for reducescatter. (#3808)tf.IndexedSlices
. (#3813)DistributedOptimizer
with Keras 2.11+. (#3822)Published by EnricoMi about 2 years ago
Published by EnricoMi about 2 years ago
register_local_var
functionality to distributed optimizers and local gradient aggregators. (#3695)BroadcastGlobalVariablesCallback
. (#3703)ncclAvg
op for NCCL allreduces. (#3646)allreduce
(min, max, product). (#3660)allreduce
using NCCL. (#3608)int8
and uint8
allreduce
and grouped_allreduce
in TensorFlow. (#3649)GPUAllgather
. (#3590)GPUReducescatter
. (#3621)hvd.grouped_allgather()
and hvd.grouped_reducescatter()
operations. (#3594)register_local_source
and use_generic_names
funtionality to DistributedGradientTape
. (#3628)PartialDistributedGradientTape()
API for model parallel use cases. (#3643)reader_worker_count
and reader_pool_type
. (#3612)transformation_edit_fields
and transformation_removed_fields
param for EstimatorParams
. (#3651)hvd.grouped_allreduce()
. (#3594)alltoall
. (#3654)process
to thread
for lower memory usage. (#3665)gather
rather than allgather
. (#3633)packaging.version
instead of distutils
version classes. (#3700)shuffle_buffer_size
from EstimatorParams
. Use shuffle
to enable shuffle or not. (#3665)backward_passes_per_step > 1
. (#3631)FuseResponses()
on BATCHED_D2D_PADDING
edge cases for Reducescatter and/or ROCm. (#3621)HorovodInternalError
rather than RuntimeError
. (#3594)nvcc
(if not in $PATH
) with older versions of CMake. (#3682)reducescatter()
and grouped_reducescatter()
to raise clean exceptions for scalar inputs. (#3699)torch/
directory to be hipified. (#3588)FindPytorch.cmake
. (#3593)Published by EnricoMi over 2 years ago
hvd.reducescatter()
operation with implementations in NCCL, MPI, and Gloo. (#3299, #3574)multiprocessing
. (#3580)op
API. (#3299)zeros_like
to avoid infinite gradient not correctly cleared up. (#3505)HorovodVersionMismatchError
subclass ImportError
instead of just a standard Exception
. (#3549)--check-build
) via sys.exit
to flush stdout. (#3272)env
to set environment vars in remote shell. (#3489)average
argument of allreduce functions. (#3299)--network-interface
is deprecated,--network-interface
multiple times or --network-interfaces
instead. (#3506)network_interface
with comma-separated string is deprecated,network_interfaces
with Iterable[str]
instead. (#3506)tranform_spec
for Petastorm datamodule. (#3543)save_hyperparameters()
. (#3527)HostDiscoveryScript
. (#3490)tensorflow_accelerator_device_info
. (#3513)AggregationHelperEager
. (#3496)NotFoundError
in TF AggregationHelper
. (#3499)Published by EnricoMi over 2 years ago
Published by tgaddair over 2 years ago
Published by EnricoMi over 2 years ago
nvcc
is not in $PATH
. (#3444)Published by tgaddair over 2 years ago
horovod
and horovod-cpu
to Ubuntu 20.04 and Python 3.8. (#3393)h5py<3
constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)tensorflow.python.keras
with tensorflow 2.6.0+. (#3403)pytorch_lightning_mnist.py
example. (#3245, #3290)hvd.barrier()
tensor queue management. (#3300)limit_train_batches
and limit_val_batches
. (#3237)hvd.elastic.state
losing some indices of processed samples when nodes dropped. (#3143)num_workers=0
and num_hosts=None
. (#3210)dirpath
typo. (#3204)Published by tgaddair about 3 years ago
Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. (#2839, #3042, #3043, #3054, #3083, #3090)
Added XLA support for Allreduce via tf.function(jit_compile=True)
. (#3053)
Added fused buffer scaling and unpack/pack kernels on GPU. (#2973)
Added support for NCCL on CUDA 11.4. (#3182)
Added fp16 compression for MXNet. (#2987)
Added terminate_on_nan flag to Spark Lightning estimator. (#3088)
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139
Added params for customizing Tensorboard callback. (#3153)
Added hvd.cross_rank()
for keras. (#3008)
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139
Implemented more asynchronous dependency handling on GPU. (#2963)
Ray: RayExecutor will now use the current placement group instead of always creating a new one. (#3134)
Lightning: turned off shuffling for validation dataset. (#2974)
Ray: RayExecutor will use the current placement group if one exists. (#3134)
Extended hvd.join()
to return the last rank that joined. (#3097)
Fix Horovod develop/editable install mode and incremental builds. (#3074)
Estimator/Lightning: use lightning datamodule. (#3084)
Fix Horovod Spark StringType and numpy type mapping issue. (#3146)
Fixed error in Keras LearningRateScheduler. (#3135)
Fixed bug in Lightning Profiler on Ray. (#3122)
Fixed torch op lazy release to prevent OOM in elastic training. (#3110)
Lightning: Fixed usage of the checkpoint callback. (#3186)
Fixed MPICH support to use Intel MPI's implementation. (#3148)
Fixed race condition in PyTorch async dataloader. (#3120)
Published by tgaddair over 3 years ago
Estimator: added support for loading data from S3, GCS, ADLS, and other remote filesystems. (#2927)
Estimator: added custom Spark data loader interface. (#2938)
LightningEstimator: added support to supply a logger and associated parameter to control the frequency of logging. (#2926)
Estimator: added check to ensure all ranks have the same device type. (#2942)
Changed behavior from using TensorBoardLogger to now using it as a fallback if a logger is not supplied. (#2926)
Ray: disabled capturing child tasks in placement group. (#2920)
Published by tgaddair over 3 years ago
Added pytorch_lightning spark estimator which enables training pytorch_lightning models. (#2713)
Added NVTX tracing hooks for profiling with Nsight Systems. (#2723)
Added a generic num_workers
API for RayExecutor
(#2870)
Supports Ray Client without code changes. (#2882)
Supports inmemory cache option for Keras Estimator. (#2896)
Added FP16 support for GPU tensor in mxnet. (#2915)
Added response caching for allgather operations. (#2872)
Estimator: add petastorm reader_pool_type into constructor (#2903)
Changed alltoall
to return the received splits as a second return value if non-uniform splits are sent. (#2631)
Changed RayExecutor
to use Ray Placement Groups for worker colocation. (#2824)
Changed Inmemory dataloader
usage for Torch Estimator with petastorm v0.11.0 release. (#2896)
Published by tgaddair almost 4 years ago
Added support for backward_passes_per_step > 1 for TF Keras graph mode. (#2346)
Added support for backward_passes_per_step > 1 for TF Keras eager execution. (#2371)
Added support for backward_passes_per_step > 1 for TF LegacyOptimizer in graph mode. (#2401)
Added grouped allreduce to enable more efficient tensor fusion and deterministic training. (#2453)
Add support for specifying op
and compression
in horovod.tensorflow.keras.allreduce()
. (#2423)
Adding support for batched D2D memcopy kernel on GPU. (#2435)
Added schema inference in Spark Estimator without sampling. (#2373)
Added Store.create("dbfs:/")
mapping to DBFSLocalStore("/dbfs/...")
. (#2376)
Changed Keras callbacks to require parameter initial_lr
of LearningRateScheduleCallback
and LearningRateWarmupCallback
. (#2459)
Changed default cycle time from 5ms to 1ms and fusion threshold from 64MB to 128MB. (#2468)
Fixed support for TensorFlow v2.4.0. (#2381)
Fixed averaging using CUDA half2 implementation one element half buffers. (#2375)
Fixed HOROVOD_THREAD_AFFINITY
when using oneCCL. (#2350)
Added timeout to SSH check in horovodrun to prevent hanging. (#2448)
Added HOROVOD_GLOO_TIMEOUT_SECONDS
value to error messages. (#2436)
Fixed race condition in dynamic timeline API. (#2341)
Fixed --log-hide-timestamp to apply to driver logs with Gloo. (#2388)
Published by tgaddair about 4 years ago
Published by tgaddair about 4 years ago
Published by tgaddair about 4 years ago
Added Databricks storage DBFSLocalStore
and support for GPU-aware scheduling to horovod.spark Estimator. (#2234)
Added ElasticSampler and PyTorch Elastic ImageNet example. (#2297)
Added ability to dynamically start and stop timeline programmatically. (#2215)
Added support for Gloo on macOS. (#2254)
Exposed name argument to TensorFlow allreduce operation. (#2325)
Added option to strip outer name scope from Horovod ops in TensorFlow. (#2328)
Fixed usage of VERBOSE=1 when setting custom MAKEFLAGS. (#2239)
Fixed bugs in Keras Elastic Callback classes. (#2289)
Fixed RelWithDebInfo build and made it the default with -03 optimizations. (#2305)
Fixed usage of tf.cond in TensorFlow alltoall gradient. (#2327)
Fixed allreduce averaging for TF IndexedSlices in ROCm path. (#2279)
Include stdexcept to handle certain compiler / frameworks that don't include it already. (#2238)
Fixed Debug builds by setting compiler options based on CMake build type. (#2263)
Skipped launching zero-sized send/recvs for NCCLAlltoall. (#2273)
Fixed missing run in tf keras elastic mode. (#2272)
Fixed loss function in TensorFlow2 elastic synthetic benchmark. (#2265)
Fixed usage of HOROVOD_MIXED_INSTALL env var in alltoall tests. (#2266)
Removed keras requirement from Ray example. (#2262)
Published by tgaddair about 4 years ago
Elastic training enables Horovod to scale up and down the number of workers dynamically at runtime, without requiring a restart or resuming from checkpoints saved to durable storage. With elastic training, workers can come and go from the Horovod job without interrupting the training process.
Support for auto-scaling can be added to any existing Horovod script with just a few modifications:
@hvd.elastic.run
.hvd.elastic.State
object.Here's an example for PyTorch:
import torch
import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
model = ...
dataset = ...
@hvd.elastic.run
def train(state):
for state.epoch in range(state.epoch, args.epochs + 1):
dataset.set_epoch(state.epoch)
dataset.set_batch_idx(state.batch_idx)
for state.batch_idx, (data, target) in enumerate(dataset):
state.optimizer.zero_grad()
output = state.model(data)
loss = F.nll_loss(output, target)
loss.backward()
state.optimizer.step()
state.commit()
optimizer = optim.SGD(model.parameters(), lr * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
def on_state_reset():
# adjust learning rate on reset
for param_group in optimizer.param_groups:
param_group['lr'] = lr * hvd.size()
state = hvd.elastic.TorchState(model, optimizer, epoch=1, batch_idx=0)
state.register_reset_callbacks([on_state_reset])
train(state)
Run using horovodrun
by specifying the minimum and maximum number of worker processes, as well as a "host discovery script" that will be used to find available workers to add at runtime:
$ horovodrun -np 8 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.sh python train.py
Elastic Horovod is supported natively with Spark auto-scaling using the hvd.spark.run_elastic
API.
For more details, see Elastic Horovod.
Ray is a distributed execution framework that makes it easy to provision and scale distributed applications, and can now be used to execute Horovod jobs without needing to coordinate the workers by hand:
from horovod.ray import RayExecutor
# Start the Ray cluster or attach to an existing Ray cluster
ray.init()
# Start num_hosts * num_slots actors on the cluster
executor = RayExecutor(
setting, num_hosts=num_hosts, num_slots=num_slots, use_gpu=True)
# Launch the Ray actors on each machine
# This will launch `num_slots` actors on each machine
executor.start()
# Using the stateless `run` method, a function can take in any args or kwargs
def train_fn():
hvd.init()
# Train the model on each worker here
...
# Execute the function on all workers at once
results = executor.run(train_fn)
executor.shutdown()
Horovod now also integrates with Ray Tune to scale up your hyperparameter search jobs. Check out the example here.
For more details, see Horovod on Ray.
The all-to-all collective can be described as a combination of a scatter and gather, where each worker will scatter a tensor to each worker, while also gathering scattered data from other workers. This type of collective communication can arise in model-parallel training strategies.
The hvd.alltoall
function takes the form hvd.alltoall(tensor, splits=None)
,
where tensor
is a multi-dimensional tensor of data to scattered and splits
is an optional 1D tensor of integers with length equal to the number of workers, describing how to split and distribute tensor. splits
is applied along the first dimension of tensor
. If splits is not provided, an equal splitting is assumed, where the first dimension is divided by the number of workers.
The implementation supports TensorFlow, PyTorch, and MXNet using the MPI backend, the CUDA-aware MPI backend via HOROVOD_GPU_ALLTOALL=MPI
, and the NCCL backend via HOROVOD_GPU_ALLTOALL=NCCL
/ HOROVOD_GPU_OPERATIONS=NCCL
.
We've added a gradient_predivide_factor
parameter in the DistributedOptimizer
, the purpose of which is to enable splitting the averaging before and after the allreduce. This can be useful in managing the numerical range for mixed precision computations.
The gradient_predivide_factor
is applied as follows:
If op == Average, gradient_predivide_factor splits the averaging
before and after the sum. Gradients are scaled by
1.0 / gradient_predivide_factor before the sum and
gradient_predivide_factor / size after the sum.
To facilitate this, additional arguments (prescale_factor
and postscale_factor
) have been added to the basic hvd.allreduce
functions, enabling the definition of multiplicative factors to scale the tensors before and after the allreduce respectively. For efficiency, the pre and post-scaling is implemented in the Horovod backend on the fused tensor buffer, rather than through framework level operations. For GPU, this required a CUDA kernel implementation to scale the GPU buffer which in turn, required adding compilation of CUDA code to the current build infrastructure.
As an additional general benefit from these changes, gradient averaging in the optimizer can now be carried out within the Horovod backend on the fused tensor buffer using the postscale_factor
argument, rather than on a tensor by tensor basis at the framework level, decreasing the overhead of each allreduce call.
CMake, previously used to compile the optional Gloo controller, is now required to install Horovod. This change introduces a number of exciting benefits for Horovod developers and users:
find_package
modules provided by CMake for MPI, CUDA, etc. to better handle a range of environment configurationssetup.py
)Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. (#1849)
Added Elastic Horovod support for Spark auto-scaling. (#1956)
Added All-to-All operation for TensorFlow, PyTorch, and MXNet. (#2143)
Added support for gradient_predivide_factor
and averaging in Horovod backend. (#1949)
Added NCCL implementation of the allgather operation. (#1952)
Added HOROVOD_GPU_OPERATIONS
installation variable to simplify enabling NCCL support for all GPU operations. (#1960)
Added TensorFlow implementation of SyncBatchNormalization
layer. (#2075)
Added hvd.is_initialized()
method. (#2020)
Added hvd.allgather_object
function for TensorFlow, PyTorch, and MXNet. (#2166)
Added hvd.broadcast_object
function for MXNet. (#2122)
Added label_shapes
parameter to KerasEstimator and TorchEstimator. (#2140)
Added optional modelCheckPoint
callback to KerasEstimator params. (#2124)
Added ssh_identity_file
argument to horovodrun
. (#2201)
Added support for horovodrun
on kubeflow/mpi-job
. (#2199)
Added Ray integration. (#2218)
Moved horovod.run.runner.run
to horovod.run
. (#2099)
HOROVOD_THREAD_AFFINITY accepts multiple values, one for every Horovod rank (#2131)
Migrated build system for native libraries to CMake (#2009)
Dropped support for Python 2. (#1954)
Dropped support for TensorFlow < 1.15. (#2169)
Dropped support for PyTorch < 1.2. (#2086)
Fixed MXNet allgather implementation to correctly handle resizing the output buffer. (#2092)
Fixed Keras Spark Estimator incompatibility with TensorFlow 1.15 due to tf.autograph
. (#2069)
Fixed API compatibility with PyTorch 1.6. (#2051)
Fixed Keras API compatibility with TensorFlow 2.4.0. (#2178)
Fixed allgather gradient for TensorFlow 2 in cases where the tensor shape is not known during graph construction. (#2121)
Fixed running using Gloo with an imbalanced number of workers per host. (#2212)
Published by tgaddair over 4 years ago
Published by tgaddair over 4 years ago