Supercharge Your Model Training
APACHE-2.0 License
Adds support for dataloaders with rank-dependent lengths. The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.
Previously, the MosaicML Logger sporadically raised an error when the python interpreter was shutting down as it attempted to flush data on Event.CLOSE
using futures, which cannot be scheduled at that time. Instead, we now only block on finishing existing data upload on Event.CLOSE
, avoiding scheduling new futures.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.4...v0.23.5
Published by mvpatel2000 4 months ago
1. Patch PyTorch 2.3.1 (https://github.com/mosaicml/composer/pull/3419)
Fixes missing import when monkeypatching device mesh functions in PyTorch 2.3.1. This is necessary for MoE training.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.3...v0.23.4
Published by karan6181 4 months ago
We've enhanced the MLflow logger's log_image
function to use the new API with time-dimension support, enabling images to be viewed in MLflow.
We've added the logging_buffer_seconds
argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.
databricks-sdk
when on Databricks platform (#3389)Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.
Previously, when loading a checkpoint with train_dataloader
, the dataset_state
would load first, and if train_dataloader
was set again afterward, load_state_dict
would be called with a None
value. Now, we've added a check in the train_dataloader
setter to skip this redundant load.
In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory
. Previously, our logic hardcoded checks for CUDA out of memory
when using device_train_microbatch_size="auto"
. Now, we check for both CUDA out of memory
and CUDA error: out of memory
.
/Shared/
prefix (#3410)Previously, for MLflow logging, we prepended the path /Users/
to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/
, which was incorrect since /Shared/
indicates a shared workspace. Now, the /Users/
prepend is skipped for paths starting with /Shared/
.
databricks-sdk
when inside the Databricks platform by @antoinebrl in https://github.com/mosaicml/composer/pull/3389
flash-attn
's CE loss for metrics by @snarayan21 in https://github.com/mosaicml/composer/pull/3394
flash-attn
's CE loss for metrics (#3394)" by @snarayan21 in https://github.com/mosaicml/composer/pull/3408
Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.2...v0.23.3
Published by bigning 5 months ago
Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.1...release/v0.23.2
Published by mvpatel2000 5 months ago
1. PyTorch 2.3.1 Upgrade
Composer now supports PyTorch 2.3.1.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.0...v0.23.1
Published by bigning 5 months ago
1. Parallelism V2 + Tensor Parallel (#3335)
Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config
attribute in the Trainer:
trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})
As we generalize to more forms of parallelism, we've deprecated fsdp_config
in favor of parallelism_config
:
trainer = Trainer(
model = model,
...
parallelism_config = {
'fsdp': {
'sharding_strategy': 'FULL_SHARD',
'data_parallel_shard_degree': 2, # Size of shard dimension
'data_parallel_replicate_degree': 2, # Size of replicate dimension
},
'tp_config': {
'tensor_parallel_degree': 2, # Size of TP dimension
'layer_plan': ... # describes how to TP layers
}
}
)
As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.
See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.
2. MLFLow API Simplification
Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:
mlflow_logger = MLFlowLogger(
tracking_uri = 'databricks',
experiment_name = '/Users/[email protected]/my-first-project/'
)
trainer = Trainer(
model = model,
...
loggers = mlflow_logger,
)
Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri
and the experiment_name
prefix:
trainer = Trainer(
model = model,
...
loggers = MLFlowLogger(experiment_name='my-first-project'),
)
3. Wallclock Save Interval
Composer now supports setting a save interval in wallclock time:
trainer = Trainer(
model = model,
...
save_interval='30m',
)
Note that most durations, such as max_duration
, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval
.
evaluator.dataloader.device_eval_batch_size
with evaluator.device_eval_microbatch_size
by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3247
load_fsdp_monolith_
with load_monolith_
by @milocress in https://github.com/mosaicml/composer/pull/3288
CheckpointSaver
instantiation inside the Trainer
by @antoinebrl in https://github.com/mosaicml/composer/pull/3334
Full Changelog: https://github.com/mosaicml/composer/compare/v0.22.0...v0.23.0
Published by snarayan21 6 months ago
Composer now supports the recently-released PyTorch version 2.3.0! Please raise any issues with us so we can address them.
rename_metrics
to Mlflow logger by @hanlint in https://github.com/mosaicml/composer/pull/3225
run_group
by @chenmoneygithub in https://github.com/mosaicml/composer/pull/3208
Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.3...v0.22.0
Published by mvpatel2000 6 months ago
1. Increased Robustness to Checkpoint Loading
We've patched several edge cases in loading sharded checkpoints, especially with DTensors, which should decrease memory usage when loading checkpoints. We've also hardened retry logic against object cloud failure, ensuring higher robustness to transient network issues.
NeptuneLogger
by @AleksanderWWW in https://github.com/mosaicml/composer/pull/3165
Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.2...v0.21.3
Published by mvpatel2000 7 months ago
Composer currently monkeypatches PyTorch for nightly versions in order to fix upstream bugs. With the release of torch 2.2.2, these monkeypatches were mistakenly applied to the stable release due to incorrect gating on imports. This release fixes the gating, enabling torch 2.2.2.
Due to bugs in computing torchmetrics on Mac devices, we move metric computation onto CPU. This previously had issues with data not properly moving to CPU.
Thank you to @hyenal for this contribution!
Composer now supports batch sampler, which previously resulted in an error if specified in the dataloader.
Thank you to @Ghelfi for this contribution!
set_epoch
on Dataloader.batch_sampler
if defined by @Ghelfi in https://github.com/mosaicml/composer/pull/3124
Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.1...v0.21.2
Published by mvpatel2000 7 months ago
The previous release broke checkpoint loading when using HSDP with mutliple replicas. This patch release fixes checkpoint loading.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.0...v0.21.1
Published by mvpatel2000 7 months ago
The Memory Monitor callback now supports aggregating memory statistics across nodes. Getting summary stats for a run's memory usage across the cluster can dramatically help debug straggler nodes or non-homogenous workloads. The memory monitor can now aggregate and log combined values at a user specified frequency.
Example:
from composer import Trainer
from composer.callbacks import MemoryMonitor
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
MemoryMonitor(
dist_aggregate_batch_interval=10, # aggregate every 10 batches
)
],
)
Large model checkpoints can be expensive to store and transfer. In this release, we've upgraded our compression support to accept several new formats which result in better compression-time tradeoffs using CLI tools. In order to use compression, you can post-fix your checkpoint name with a compression path. We know support the following extensions:
Example:
from composer import Trainer
from composer.callbacks import MemoryMonitor
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
optimizers=optimizer,
max_duration="1ep",
save_filename='ep{epoch}-ba{batch}-rank{rank}.pt.lz4',
)
Thank you to @mbway for adding this support!
post_close
call by @chenmoneygithub in https://github.com/mosaicml/composer/pull/3093
NeptuneLogger
by @AleksanderWWW in https://github.com/mosaicml/composer/pull/3085
NeptuneLogger
" by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3111
Full Changelog: https://github.com/mosaicml/composer/compare/v0.20.1...v0.21.0
Published by mvpatel2000 8 months ago
Composer now supports torch 2.2.1! We've raised the pin to allow the latest torch, and we've upstreamed all torch monkeypatches so Composer can run out of the box with the latest and greatest torch features.
Published by j316chuck 8 months ago
Composer now supports logging training data to neptune.ai using the NeptuneLogger
. To get started:
neptune_project = 'test_project'
neptune_api_token = 'test_token'
neptune_logger = NeptuneLogger(
project=neptune_project,
api_token=neptune_api_token,
rank_zero_only=False,
mode='debug',
upload_artifacts=True,
)
We also have an example project demonstrating all the awesome things you can do with this integration!
Additional information on the NeptuneLogger
can be found in the docs.
Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.
Example:
from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
OOMObserver(
folder="traces",
overwrite=true,
filename="rank{rank}_oom",
remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
)
],
)
OOM Visualization:
Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.
Example commands:
mcli logs <run-name> --node x --gpu x
Note, this defaults to node rank 0 if --node
is not provided.
Also, we can find the logs of any global gpu rank with the command:
mcli logs <run-name> --global-gpu-rank x
update_metric
by @maxisawesome in https://github.com/mosaicml/composer/pull/2965
Full Changelog: https://github.com/mosaicml/composer/compare/v0.19.1...v0.20.0
Published by milocress 9 months ago
1. New Event: BEFORE_LOAD (#2974)
Composer now has the events Event.BEFORE_LOAD
, which lets users modify state before a model is loaded. This is particularly useful for accessing certain attributes which may not exist at Event.INIT
, such as the dataloader state.
2. Registering model in MLFlow with run id (#2967)
The MLFlow logger now has register_model_with_run_id
, which allows users to register a model based on the run_id. This is a different way of registering the model which preserves the link to the mlflow runs.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.19.0...v0.19.1
Published by j316chuck 9 months ago
Composer now supports elastic saving and loading of DTensors at various mesh sizes.
Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.
composer_model = MyComposerModel(...)
trainer = Trainer(
model=composer_model,
save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
logger=MLFlowLogger(...),
load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
...
)
Composer now has improved communication/computation overlap in our FSDP code which should improve MFU across several architectures.
Initial support of Python3.11 + Torch2.2 added in Composer.
PEFT LoRA is now supported in the HuggingFaceModel class.
in_context_learning_evaluation.py
has a new design with cleaner abstractions and easier interfaces to work wtih.
Composer now supports saving your model in Azure.
Composer now supports saving your model in MLFlow.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.2...v0.19.0
Published by b-chu 9 months ago
Full Changelog: https://github.com/mosaicml/composer/compare/v0.18.1...v0.18.2
Published by b-chu 9 months ago
Full Changelog: https://github.com/mosaicml/composer/compare/v0.18.0...v0.18.1
Published by b-chu 9 months ago
This release has been yanked, please skip directly to Composer v0.18.1
Composer now supports elastic saving and loading of DTensors at various mesh sizes.
Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.
composer_model = MyComposerModel(...)
trainer = Trainer(
model=composer_model,
save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
logger=MLFlowLogger(...),
load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
...
)
Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.2...v0.18.0
Published by irenedea 9 months ago
Enables elastic saving and loading of DTensors at various mesh sizes.
Artifacts, such as checkpoints, can now be logged to Databricks-managed MLFlow.
composer_model = MyComposerModel(n_layers=3)
trainer = Trainer(
model=composer_model,
max_duration='4ba',
save_folder='dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
loggers=MLFlowLogger(...),
...
)
Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.2...v0.18.0
Published by mvpatel2000 10 months ago
1. Torch 2.1.1 Support
Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.
2. Faster OCI Upload/Download
Composer now supports multi-part upload/download to OCI, which should speedup object store times.
3. Memory Profiling
We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.
1. FSDP Initialization with Meta
Previously, our FSDP integration had a bug with initializing weights when using device=meta
, which resulted in an additional scaling. This has now been fixed, so device
and distributed strategies should not affect parallelization strategy.
Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.1...v0.17.2