horovod | Apache Spark Ecosystem Directory

Bot releases are hidden (Show)

horovod - v0.28.1: Build fixes (ROCm, GCC 12) Latest Release

Published by maxhgerlach over 1 year ago

Fixed

Fixed build with gcc 12. (#3925)
PyTorch: Fixed build on ROCm. (#3928)
TensorFlow: Fixed local_rank_op. (#3940)

horovod - v0.28.0: Keras 2.11+ optimizers, faster reducescatter, fixes for latest TensorFlow, CUDA, NCCL

Published by maxhgerlach over 1 year ago

Added

TensorFlow: Added new get_local_and_global_gradients to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)

Changed

Improved reducescatter performance by allocating output tensors before enqueuing the operation. (#3824)
TensorFlow: Ensured that tf.logical_and within allreduce tf.cond runs on CPU. (#3885)
TensorFlow: Added support for Keras 2.11+ optimizers. (#3860)
CUDA_VISIBLE_DEVICES environment variable is no longer passed to remote nodes. (#3865)

Fixed

Fixed build with ROCm. (#3839, #3848)
Fixed build of Docker image horovod-nvtabular. (#3851)
Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. (#3867, #3846)
Fixed compatibility with TensorFlow 2.12 and recent nightly versions. (#3864, #3894, #3906, #3907)
Fixed missing arguments of Keras allreduce function. (#3905)
Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. (#3912)

horovod - Custom data loaders in Spark TorchEstimator, more model parallelism in Keras, improved allgather performance, fixes for latest PyTorch and TensorFlow versions

Published by maxhgerlach over 1 year ago

Added

Keras: Added PartialDistributedOptimizer API. (#3738)
Added HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. (#3737)
Added support for reducescatter arguments prescale_factor and postscale_factor and moved averaging into Horovod backend. (#3815)
Spark Estimator: Added support for custom data loaders in TorchEstimator. (#3787)
Spark Estimator: Added NVTabular data loader for TorchEstimator. (#3787)

Changed

Improved NCCL performance for fused allgather operations through padding for better memory alignment. (#3727)
Improved look-ahead tensor fusion buffer size estimates when allgather and other operations are mixed. (#3727)

Fixed

ROCm: Fixed GPU MPI operations support in build. (#3746)
PyTorch: Fixed linking order to avoid using Gloo from PyTorch dynamic libraries. (#3750)
Fixed memory leak in MPI_GPUAllgather. (#3727)
TensorFlow: Fixed deprecation warnings when building with TensorFlow 2.11. (#3767)
Keras: Added support for additional arguments to SyncBatchNormalization._moments(). (#3775)
Fixed version number parsing with pypa/packaging 22.0. (#3794)
TensorFlow: Fixed linking with nightly versions leading up to TensorFlow 2.12. (#3755)
TensorFlow: Fixed handling of tf.IndexedSlices types when scaling local gradients. (#3786)
Added missing MEMCPY_IN_FUSION_BUFFER timeline event for reducescatter. (#3808)
Fixed build of Docker image horovod-nvtabular. (#3817)
TensorFlow: Several fixes for allreduce and grouped allreduce handling of tf.IndexedSlices. (#3813)
Spark: Restricted PyArrow to versions < 11.0. (#3830)
TensorFlow: Resolved conflicts between multiple optimizer wrappers reusing the same gradient accumulation counter. (#3783)
TensorFlow/Keras: Fixed DistributedOptimizer with Keras 2.11+. (#3822)
PyTorch, ROCm: Fixed allreduce average on process sets. (#3815)

horovod - Hotfix: Fixing packaging import during install

Published by EnricoMi about 2 years ago

Fixed

Fixed packaging import during install to occur after install_requires. (#3741)

horovod - Better support for model parallel, more reduction operations for allreduce (min, max, product), grouped allgather and reducedscatter, Petastorm reader level parallel shuffling, NVTabular data loader

Published by EnricoMi about 2 years ago

Added

Spark Estimator: Added support for custom data loaders in KerasEstimator. (#3603)
Spark Estimator: Added NVTabular data loader for KerasEstimator. (#3603)
Spark Estimator: Added gradient accumulation support to Spark torch estimator. (#3681)
TensorFlow: Added register_local_var functionality to distributed optimizers and local gradient aggregators. (#3695)
TensorFlow: Added support for local variables for BroadcastGlobalVariablesCallback. (#3703)
Enabled use of native ncclAvg op for NCCL allreduces. (#3646)
Added support for additional reduction operations for allreduce (min, max, product). (#3660)
Added 2D torus allreduce using NCCL. (#3608)
Added support for Petastorm reader level parallel shuffling. (#3665)
Added random seed support for Lightning datamodule to generate reproducible data loading outputs. (#3665)
Added support for int8 and uint8 allreduce and grouped_allreduce in TensorFlow. (#3649)
Added support for batched memory copies in GPUAllgather. (#3590)
Added support for batched memory copies in GPUReducescatter. (#3621)
Added hvd.grouped_allgather() and hvd.grouped_reducescatter() operations. (#3594)
Added warning messages if output tensor memory allocations fail. (#3594)
Added register_local_source and use_generic_names funtionality to DistributedGradientTape. (#3628)
Added PartialDistributedGradientTape() API for model parallel use cases. (#3643)
Spark/Lightning: Added reader_worker_count and reader_pool_type. (#3612)
Spark/Lightning: Added transformation_edit_fields and transformation_removed_fields param for EstimatorParams. (#3651)
TensorFlow: Added doc string for hvd.grouped_allreduce(). (#3594)
ROCm: Enabled alltoall. (#3654)

Changed

Default Petastorm reader pool is changed from process to thread for lower memory usage. (#3665)
Keras: Support only legacy optimizers in Keras 2.11+. (#3725)
Gloo: When negotiating, use gather rather than allgather. (#3633)
Use packaging.version instead of distutils version classes. (#3700)

Deprecated

Deprecated field shuffle_buffer_size from EstimatorParams. Use shuffle to enable shuffle or not. (#3665)

Removed

Build: Removed std::regex use for better cxxabi11 compatibility. (#3584)

Fixed

TensorFlow: Fixed the optimizer iteration increments when backward_passes_per_step > 1. (#3631)
Fixed FuseResponses() on BATCHED_D2D_PADDING edge cases for Reducescatter and/or ROCm. (#3621)
PyTorch: Fixed Reducescatter functions to raise HorovodInternalError rather than RuntimeError. (#3594)
PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. (#3594)
Fixed race condition in PyTorch allocation handling. (#3639)
Build: Fixed finding nvcc (if not in $PATH) with older versions of CMake. (#3682)
Fixed reducescatter() and grouped_reducescatter() to raise clean exceptions for scalar inputs. (#3699)
Updated Eigen submodule to fix build on macOS with aarch64. (#3619)
Build: Correctly select files in torch/ directory to be hipified. (#3588)
Build: Modify regex match for CUDA|ROCm in FindPytorch.cmake. (#3593)
Build: Fixed ROCm-specific build failure. (#3630)

horovod - Reducescatter for NCCL, MPI and Gloo, AMD GPU XLA Op implementation, Spark Estimator improvements, TensorFlow Data Service Horovod job, Elastic run API

Published by EnricoMi over 2 years ago

Added

Added hvd.reducescatter() operation with implementations in NCCL, MPI, and Gloo. (#3299, #3574)
Added AMD GPU XLA Op Implementation. (#3486)
Added Horovod job to spin up distributed TensorFlow Data Service. (#3525)
Spark: Expose random seed as an optional parameter. (#3517)
Add Helm Chart. (#3546)
Elastic: Add elastic run API. (#3503)
Spark Estimator: Expose random seed for model training reproducibility. (#3517)
Spark Estimator: Add option whether to use GPUs at all. (#3526)
Spark Estimator: Expose parameter to set start method for multiprocessing. (#3580)

Changed

MXNet: Updated allreduce functions to newer op API. (#3299)
TensorFlow: Make TensorFlow output allocations asynchronous when using NCCL backend. (#3464)
TensorFlow: Clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up. (#3505)
Make HorovodVersionMismatchError subclass ImportError instead of just a standard Exception. (#3549)
Elastic: Catch any exception to prevent the discovery thread from silently dying. (#3436)
Horovodrun: Exit check_build (--check-build) via sys.exit to flush stdout. (#3272)
Spark: Use env to set environment vars in remote shell. (#3489)
Build: Avoid redundant ptx generation for maximum specified compute capability. (#3509)

Deprecated

MXNet: Deprecated average argument of allreduce functions. (#3299)
Public and internal APIs: deprecate use of np, min_np, max_np,
use num_proc, min_num_proc, and max_num_proc, respectively, instead. (#3409)
Horovodrun: Providing multiple NICS as comma-separated string via --network-interface is deprecated,
use --network-interface multiple times or --network-interfaces instead. (#3506)
horovod.run: Argument network_interface with comma-separated string is deprecated,
use network_interfaces with Iterable[str] instead. (#3506)

Fixed

Fallback to NCCL shared lib if static one is not found. (#3500
Spark/Lightning: Added missing tranform_spec for Petastorm datamodule. (#3543)
Spark/Lightning: Fixed PTL Spark example with checkpoint usage by calling save_hyperparameters(). (#3527)
Elastic: Fixed empty hostname returned from HostDiscoveryScript. (#3490)
TensorFlow 2.9: Fixed build for API change related to tensorflow_accelerator_device_info. (#3513)
TensorFlow 2.10: Bumped build partially to C++17. (#3558)
TensorFlow: Fixed gradient update timing in TF AggregationHelperEager. (#3496)
TensorFlow: Fixed resource NotFoundError in TF AggregationHelper. (#3499)

horovod - Hotfix: DBFSLocalStore get_localized_path implementation

Published by EnricoMi over 2 years ago

Fixed

Make DBFSLocalStore support "file:/dbfs/...", implement get_localized_path. (#3510)

horovod - Hotfix: Fix ignored cuda arch flags

Published by tgaddair over 2 years ago

Fixed

[Setup] Require fsspec >= 2010.07.0 (#3451)
Fix ignored cuda arch flags (#3462

horovod - Hotfix: CMake better finding CUDA

Published by EnricoMi over 2 years ago

Fixed

Extended CMake build script to often find CUDA even if nvcc is not in $PATH. (#3444)

horovod - Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions

Published by tgaddair over 2 years ago

Added

Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
TensorFlow: Added in-place broadcasting of variables. (#3128)
Elastic: Added support for resurrecting blacklisted hosts. (#3319)
MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
Spark/Lightning: Added history to lightning estimator. (#3214)

Changed

Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
Moved released Docker image horovod and horovod-cpu to Ubuntu 20.04 and Python 3.8. (#3393)
Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
Make checkpoint name optional so that user can save to h5 format. (#3411)

Deprecated

Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)

Removed

Spark: Removed h5py<3 constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)

Fixed

Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
PyTorch: Fixed finalization of ProcessSetTable. (#3351)
Fixed remote trainers to point to the correct shared lib path. (#3258)
Fixed imports from tensorflow.python.keras with tensorflow 2.6.0+. (#3403)
Fixed Adasum communicator init logic. (#3379)
Lightning: Fixed resume logger. (#3375)
Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
Fixed possible integer overflow in multiplication. (#3368)
Fixed the pytorch_lightning_mnist.py example. (#3245, #3290)
Fixed barrier segmentation fault. (#3313)
Fixed hvd.barrier() tensor queue management. (#3300)
Fixed PyArrow "list index out of range" IndexError. (#3274)
Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
Spark/Lightning: Fixed setting limit_train_batches and limit_val_batches. (#3237)
Elastic: Fixed ElasticSampler and hvd.elastic.state losing some indices of processed samples when nodes dropped. (#3143)
Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
Ray: Fixed RayExecutor to fail when num_workers=0 and num_hosts=None. (#3210)
Spark/Lightning: Fixed checkpoint callback dirpath typo. (#3204)

horovod - Process sets, XLA support, improved GPU backend

Published by tgaddair about 3 years ago

Added

Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. (#2839, #3042, #3043, #3054, #3083, #3090)
Added XLA support for Allreduce via tf.function(jit_compile=True). (#3053)
Added fused buffer scaling and unpack/pack kernels on GPU. (#2973)
Added support for NCCL on CUDA 11.4. (#3182)
Added fp16 compression for MXNet. (#2987)
Added terminate_on_nan flag to Spark Lightning estimator. (#3088)
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139
Added params for customizing Tensorboard callback. (#3153)
Added hvd.cross_rank() for keras. (#3008)
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139

Changed

Implemented more asynchronous dependency handling on GPU. (#2963)
Ray: RayExecutor will now use the current placement group instead of always creating a new one. (#3134)
Lightning: turned off shuffling for validation dataset. (#2974)
Ray: RayExecutor will use the current placement group if one exists. (#3134)
Extended hvd.join() to return the last rank that joined. (#3097)

Removed

Spark/Keras: remove bare Keras support. (#3191)

Fixed

Fix Horovod develop/editable install mode and incremental builds. (#3074)
Estimator/Lightning: use lightning datamodule. (#3084)
Fix Horovod Spark StringType and numpy type mapping issue. (#3146)
Fixed error in Keras LearningRateScheduler. (#3135)
Fixed bug in Lightning Profiler on Ray. (#3122)
Fixed torch op lazy release to prevent OOM in elastic training. (#3110)
Lightning: Fixed usage of the checkpoint callback. (#3186)
Fixed MPICH support to use Intel MPI's implementation. (#3148)
Fixed race condition in PyTorch async dataloader. (#3120)
Keras: Fixed learning rate scheduler. (#3142, #3135)

horovod - Remote filesystem support, estimator fixes

Published by tgaddair over 3 years ago

Added

Estimator: added support for loading data from S3, GCS, ADLS, and other remote filesystems. (#2927)
Estimator: added custom Spark data loader interface. (#2938)
LightningEstimator: added support to supply a logger and associated parameter to control the frequency of logging. (#2926)
Estimator: added check to ensure all ranks have the same device type. (#2942)

Changed

Changed behavior from using TensorBoardLogger to now using it as a fallback if a logger is not supplied. (#2926)
Ray: disabled capturing child tasks in placement group. (#2920)

Fixed

Fixed hvd.tensorflow.keras.Compression, accidentally removed in v0.22.0. (#2945)
TorchEstimator: fixed usage of validation_steps in place of validation_steps_per_epoch. (#2918)
TensorFlow: fixed C++ API for TF v2.6.0. (#2932)
PyTorch: fixed sparse_allreduce_async for PyTorch v0.10.0. (#2965)

horovod - PyTorch Lightning Estimator, Nsight profiling, PyTorch 1.9 support

Published by tgaddair over 3 years ago

Added

Added pytorch_lightning spark estimator which enables training pytorch_lightning models. (#2713)
Added NVTX tracing hooks for profiling with Nsight Systems. (#2723)
Added a generic num_workers API for RayExecutor (#2870)
Supports Ray Client without code changes. (#2882)
Supports inmemory cache option for Keras Estimator. (#2896)
Added FP16 support for GPU tensor in mxnet. (#2915)
Added response caching for allgather operations. (#2872)
Estimator: add petastorm reader_pool_type into constructor (#2903)

Changed

Changed alltoall to return the received splits as a second return value if non-uniform splits are sent. (#2631)
Changed RayExecutor to use Ray Placement Groups for worker colocation. (#2824)
Changed Inmemory dataloader usage for Torch Estimator with petastorm v0.11.0 release. (#2896)

Fixed

Changed RayExecutor to use Ray node ID to enable multi-container:single-host setups. (#2883)
Support sparse gradients aggregation in TF1 Keras. (#2879)
Respect global_step parameter for LegacyOptimizers when aggregating gradients. (#2879)
Fixed compatibility with PyTorch 1.9.0. (#2829)

horovod - Local Gradient Aggregation, Grouped Allreduce

Published by tgaddair almost 4 years ago

Detailed Changes

Added

Added support for backward_passes_per_step > 1 for TF Keras graph mode. (#2346)
Added support for backward_passes_per_step > 1 for TF Keras eager execution. (#2371)
Added support for backward_passes_per_step > 1 for TF LegacyOptimizer in graph mode. (#2401)
Added grouped allreduce to enable more efficient tensor fusion and deterministic training. (#2453)
Add support for specifying op and compression in horovod.tensorflow.keras.allreduce(). (#2423)
Adding support for batched D2D memcopy kernel on GPU. (#2435)
Added schema inference in Spark Estimator without sampling. (#2373)
Added Store.create("dbfs:/") mapping to DBFSLocalStore("/dbfs/..."). (#2376)

Changed

Changed Keras callbacks to require parameter initial_lr of LearningRateScheduleCallback and LearningRateWarmupCallback. (#2459)
Changed default cycle time from 5ms to 1ms and fusion threshold from 64MB to 128MB. (#2468)

Fixed

Fixed support for TensorFlow v2.4.0. (#2381)
Fixed averaging using CUDA half2 implementation one element half buffers. (#2375)
Fixed HOROVOD_THREAD_AFFINITY when using oneCCL. (#2350)
Added timeout to SSH check in horovodrun to prevent hanging. (#2448)
Added HOROVOD_GLOO_TIMEOUT_SECONDS value to error messages. (#2436)
Fixed race condition in dynamic timeline API. (#2341)
Fixed --log-hide-timestamp to apply to driver logs with Gloo. (#2388)

horovod - Elastic Horovod on Ray

Published by tgaddair about 4 years ago

Detailed Changes

Added

Added Elastic Ray integration. (#2291)

Changed

Removed dependency on SSH access for Ray. (#2275)

horovod - Hotfix: build without MXNet installed

Published by tgaddair about 4 years ago

Detailed Changes

Fixed

Fixed building Horovod without HOROVOD_WITHOUT_MXNET when MXNet is not installed. (#2334)

horovod - Bugfixes, Databricks Runtime support for Estimators, ElasticSampler

Published by tgaddair about 4 years ago

Detailed Changes

Added

Added Databricks storage DBFSLocalStore and support for GPU-aware scheduling to horovod.spark Estimator. (#2234)
Added ElasticSampler and PyTorch Elastic ImageNet example. (#2297)
Added ability to dynamically start and stop timeline programmatically. (#2215)
Added support for Gloo on macOS. (#2254)
Exposed name argument to TensorFlow allreduce operation. (#2325)
Added option to strip outer name scope from Horovod ops in TensorFlow. (#2328)

Fixed

Fixed usage of VERBOSE=1 when setting custom MAKEFLAGS. (#2239)
Fixed bugs in Keras Elastic Callback classes. (#2289)
Fixed RelWithDebInfo build and made it the default with -03 optimizations. (#2305)
Fixed usage of tf.cond in TensorFlow alltoall gradient. (#2327)
Fixed allreduce averaging for TF IndexedSlices in ROCm path. (#2279)
Include stdexcept to handle certain compiler / frameworks that don't include it already. (#2238)
Fixed Debug builds by setting compiler options based on CMake build type. (#2263)
Skipped launching zero-sized send/recvs for NCCLAlltoall. (#2273)
Fixed missing run in tf keras elastic mode. (#2272)
Fixed loss function in TensorFlow2 elastic synthetic benchmark. (#2265)
Fixed usage of HOROVOD_MIXED_INSTALL env var in alltoall tests. (#2266)
Removed keras requirement from Ray example. (#2262)

horovod - Elastic Horovod, Ray integration, All-to-All, Gradient Predivide, CMake build system

Published by tgaddair about 4 years ago

Elastic Horovod API + Spark Auto-Scaling (#1849, #1956)

Elastic training enables Horovod to scale up and down the number of workers dynamically at runtime, without requiring a restart or resuming from checkpoints saved to durable storage. With elastic training, workers can come and go from the Horovod job without interrupting the training process.

Support for auto-scaling can be added to any existing Horovod script with just a few modifications:

Decorate retryable functions with @hvd.elastic.run.
Track state that needs to be kept in sync across workers in a hvd.elastic.State object.
Perform all Horovod collective operations (allreduce, allgather, broadcast, etc.) inside the retryable functions.

Here's an example for PyTorch:

import torch
import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

model = ...
dataset = ...

@hvd.elastic.run
def train(state):
    for state.epoch in range(state.epoch, args.epochs + 1):
        dataset.set_epoch(state.epoch)
        dataset.set_batch_idx(state.batch_idx)
        for state.batch_idx, (data, target) in enumerate(dataset):
            state.optimizer.zero_grad()
            output = state.model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            state.optimizer.step()
            state.commit()

optimizer = optim.SGD(model.parameters(), lr * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)

def on_state_reset():
    # adjust learning rate on reset
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr * hvd.size()

state = hvd.elastic.TorchState(model, optimizer, epoch=1, batch_idx=0)
state.register_reset_callbacks([on_state_reset])
train(state)

Run using horovodrun by specifying the minimum and maximum number of worker processes, as well as a "host discovery script" that will be used to find available workers to add at runtime:

$ horovodrun -np 8 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.sh python train.py

Elastic Horovod is supported natively with Spark auto-scaling using the hvd.spark.run_elastic API.

For more details, see Elastic Horovod.

Horovod on Ray (#2218)

Ray is a distributed execution framework that makes it easy to provision and scale distributed applications, and can now be used to execute Horovod jobs without needing to coordinate the workers by hand:

from horovod.ray import RayExecutor

# Start the Ray cluster or attach to an existing Ray cluster
ray.init()

# Start num_hosts * num_slots actors on the cluster
executor = RayExecutor(
    setting, num_hosts=num_hosts, num_slots=num_slots, use_gpu=True)

# Launch the Ray actors on each machine
# This will launch `num_slots` actors on each machine
executor.start()

# Using the stateless `run` method, a function can take in any args or kwargs
def train_fn():
    hvd.init()
    # Train the model on each worker here
    ...

# Execute the function on all workers at once
results = executor.run(train_fn)

executor.shutdown()

Horovod now also integrates with Ray Tune to scale up your hyperparameter search jobs. Check out the example here.

For more details, see Horovod on Ray.

All-to-All Operation (#2143)

The all-to-all collective can be described as a combination of a scatter and gather, where each worker will scatter a tensor to each worker, while also gathering scattered data from other workers. This type of collective communication can arise in model-parallel training strategies.

The hvd.alltoall function takes the form hvd.alltoall(tensor, splits=None),
where tensor is a multi-dimensional tensor of data to scattered and splits is an optional 1D tensor of integers with length equal to the number of workers, describing how to split and distribute tensor. splits is applied along the first dimension of tensor. If splits is not provided, an equal splitting is assumed, where the first dimension is divided by the number of workers.

The implementation supports TensorFlow, PyTorch, and MXNet using the MPI backend, the CUDA-aware MPI backend via HOROVOD_GPU_ALLTOALL=MPI, and the NCCL backend via HOROVOD_GPU_ALLTOALL=NCCL / HOROVOD_GPU_OPERATIONS=NCCL.

Gradient Predivide Factor (#1949)

We've added a gradient_predivide_factor parameter in the DistributedOptimizer, the purpose of which is to enable splitting the averaging before and after the allreduce. This can be useful in managing the numerical range for mixed precision computations.

The gradient_predivide_factor is applied as follows:

        If op == Average, gradient_predivide_factor splits the averaging
        before and after the sum. Gradients are scaled by
        1.0 / gradient_predivide_factor before the sum and
        gradient_predivide_factor / size after the sum.

To facilitate this, additional arguments (prescale_factor and postscale_factor) have been added to the basic hvd.allreduce functions, enabling the definition of multiplicative factors to scale the tensors before and after the allreduce respectively. For efficiency, the pre and post-scaling is implemented in the Horovod backend on the fused tensor buffer, rather than through framework level operations. For GPU, this required a CUDA kernel implementation to scale the GPU buffer which in turn, required adding compilation of CUDA code to the current build infrastructure.

As an additional general benefit from these changes, gradient averaging in the optimizer can now be carried out within the Horovod backend on the fused tensor buffer using the postscale_factor argument, rather than on a tensor by tensor basis at the framework level, decreasing the overhead of each allreduce call.

CMake Build System (#2009)

CMake, previously used to compile the optional Gloo controller, is now required to install Horovod. This change introduces a number of exciting benefits for Horovod developers and users:

Much faster installation times through a parallel task build
Incremental builds (almost instantaneous build when developing and making small changes at a time)
Separation of the build config phase with the build phase (less overhead for repeated builds)
Reuse find_package modules provided by CMake for MPI, CUDA, etc. to better handle a range of environment configurations
Libraries can be built outside of the python build process (no longer requiring setup.py)
Flexibility for the build system (make, ninja, IDEs, etc.)

Detailed Changes

Added

Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. (#1849)
Added Elastic Horovod support for Spark auto-scaling. (#1956)
Added All-to-All operation for TensorFlow, PyTorch, and MXNet. (#2143)
Added support for gradient_predivide_factor and averaging in Horovod backend. (#1949)
Added NCCL implementation of the allgather operation. (#1952)
Added HOROVOD_GPU_OPERATIONS installation variable to simplify enabling NCCL support for all GPU operations. (#1960)
Added TensorFlow implementation of SyncBatchNormalization layer. (#2075)
Added hvd.is_initialized() method. (#2020)
Added hvd.allgather_object function for TensorFlow, PyTorch, and MXNet. (#2166)
Added hvd.broadcast_object function for MXNet. (#2122)
Added label_shapes parameter to KerasEstimator and TorchEstimator. (#2140)
Added optional modelCheckPoint callback to KerasEstimator params. (#2124)
Added ssh_identity_file argument to horovodrun. (#2201)
Added support for horovodrun on kubeflow/mpi-job. (#2199)
Added Ray integration. (#2218)

Changed

Moved horovod.run.runner.run to horovod.run. (#2099)
HOROVOD_THREAD_AFFINITY accepts multiple values, one for every Horovod rank (#2131)
Migrated build system for native libraries to CMake (#2009)

Deprecated

HOROVOD_CCL_BGT_AFFINITY is deprected. Use HOROVOD_THREAD_AFFINITY instead (#2131)

Removed

Dropped support for Python 2. (#1954)
Dropped support for TensorFlow < 1.15. (#2169)
Dropped support for PyTorch < 1.2. (#2086)

Fixed

Fixed MXNet allgather implementation to correctly handle resizing the output buffer. (#2092)
Fixed Keras Spark Estimator incompatibility with TensorFlow 1.15 due to tf.autograph. (#2069)
Fixed API compatibility with PyTorch 1.6. (#2051)
Fixed Keras API compatibility with TensorFlow 2.4.0. (#2178)
Fixed allgather gradient for TensorFlow 2 in cases where the tensor shape is not known during graph construction. (#2121)
Fixed running using Gloo with an imbalanced number of workers per host. (#2212)