composer - v0.23.5 Latest Release

Published by XiaohanZhangCMU 4 months ago

What's New

1. Variable length dataloaders (#3416)

Adds support for dataloaders with rank-dependent lengths. The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.

Bug Fixed

1. Remove close flush for mosaicml logger (#3446)

Previously, the MosaicML Logger sporadically raised an error when the python interpreter was shutting down as it attempted to flush data on Event.CLOSE using futures, which cannot be scheduled at that time. Instead, we now only block on finishing existing data upload on Event.CLOSE, avoiding scheduling new futures.

What's Changed

Update numpy requirement from <1.27.0,>=1.21.5 to >=1.21.5,<2.1.0 by @dependabot in https://github.com/mosaicml/composer/pull/3406
Restore dev version by @karan6181 in https://github.com/mosaicml/composer/pull/3417
Save checkpoint to disk for API with new save layout by @eracah in https://github.com/mosaicml/composer/pull/3399
Patch PyTorch 2.3.1 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3419
Fixes some typing issues by @dakinggg in https://github.com/mosaicml/composer/pull/3418
Fix style by @b-chu in https://github.com/mosaicml/composer/pull/3420
Bump coverage[toml] from 7.5.3 to 7.5.4 by @dependabot in https://github.com/mosaicml/composer/pull/3422
Update psutil requirement from <6,>=5.8.0 to >=5.8.0,<7 by @dependabot in https://github.com/mosaicml/composer/pull/3424
Add support for variable length dataloaders in DDP by @JAEarly in https://github.com/mosaicml/composer/pull/3416
Hsdp + MoE CI tests by @KuuCi in https://github.com/mosaicml/composer/pull/3378
Bumping MLflow version to 2.14.1 by @JackZ-db in https://github.com/mosaicml/composer/pull/3425
Skip HSDP + TP pytests that require torch 2.3 or above by @KuuCi in https://github.com/mosaicml/composer/pull/3426
Remove CodeQL workflow by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3429
Remove save overwrite by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3431
Fixes to TP Docs by @snarayan21 in https://github.com/mosaicml/composer/pull/3430
Lower the system metrics logging frequency to reduce MLflow server's load by @chenmoneygithub in https://github.com/mosaicml/composer/pull/3436
Update paramiko requirement from <3,>=2.11.0 to >=3.4.0,<4 by @dependabot in https://github.com/mosaicml/composer/pull/3439
Bump CI testing version by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3433
Fix docstring for EVAL_AFTER_ALL/EVAL_BEFORE_ALL by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3445
Remove close flush for mosaicml logger by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3446
Remove MosaicMLLambdaEvalClient by @aspfohl in https://github.com/mosaicml/composer/pull/3432
Relax hf hub pin by @dakinggg in https://github.com/mosaicml/composer/pull/3435
Pytest skip 2 by @KuuCi in https://github.com/mosaicml/composer/pull/3448
bump version v0.23.5 by @XiaohanZhangCMU in https://github.com/mosaicml/composer/pull/3450

Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.4...v0.23.5

composer - v0.23.4

Published by mvpatel2000 4 months ago

Bug Fixes

1. Patch PyTorch 2.3.1 (https://github.com/mosaicml/composer/pull/3419)

Fixes missing import when monkeypatching device mesh functions in PyTorch 2.3.1. This is necessary for MoE training.

Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.3...v0.23.4

composer - v0.23.3

Published by karan6181 4 months ago

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

We've enhanced the MLflow logger's log_image function to use the new API with time-dimension support, enabling images to be viewed in MLflow.

2. Add logging buffer time to MLFLow logger (#3401)

We've added the logging_buffer_seconds argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.

Bug Fixes

1. Only require `databricks-sdk` when on Databricks platform (#3389)

Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.

2. Skip extra dataset state load during job resumption (#3393)

Previously, when loading a checkpoint with train_dataloader, the dataset_state would load first, and if train_dataloader was set again afterward, load_state_dict would be called with a None value. Now, we've added a check in the train_dataloader setter to skip this redundant load.

3. Fix auto-microbatching on CUDA 12.4 (#3400)

In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory. Previously, our logic hardcoded checks for CUDA out of memory when using device_train_microbatch_size="auto". Now, we check for both CUDA out of memory and CUDA error: out of memory.

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)

Previously, for MLflow logging, we prepended the path /Users/ to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/, which was incorrect since /Shared/ indicates a shared workspace. Now, the /Users/ prepend is skipped for paths starting with /Shared/.

What's Changed

Bump CI from 0.0.7 to 0.0.8 by @KuuCi in https://github.com/mosaicml/composer/pull/3383
Fix backward compatibility caused by missing eval metrics class by @bigning in https://github.com/mosaicml/composer/pull/3385
Bump version v0.23.2 by @bigning in https://github.com/mosaicml/composer/pull/3386
Restore dev version by @bigning in https://github.com/mosaicml/composer/pull/3388
Only requires databricks-sdk when inside the Databricks platform by @antoinebrl in https://github.com/mosaicml/composer/pull/3389
Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 by @dependabot in https://github.com/mosaicml/composer/pull/3392
Bump cryptography from 42.0.6 to 42.0.8 by @dependabot in https://github.com/mosaicml/composer/pull/3391
Skip extra dataset state load by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3393
Remove FSDP restriction from PyTorch 1.13 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3395
Check for 'CUDA error: out of memory' when auto-microbatching by @JAEarly in https://github.com/mosaicml/composer/pull/3400
Add tokens to iterations by @b-chu in https://github.com/mosaicml/composer/pull/3374
Busy wait utils in dist by @dakinggg in https://github.com/mosaicml/composer/pull/3396
Add buffering time to mlflow logger by @chenmoneygithub in https://github.com/mosaicml/composer/pull/3401
Add missing import for PyTorch 2.3.1 device mesh slicing by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3402
Add pynvml to mlflow dep group by @dakinggg in https://github.com/mosaicml/composer/pull/3404
min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged by @JackZ-db in https://github.com/mosaicml/composer/pull/3373
Simplify launcher world size parsing by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3398
Optionally use flash-attn's CE loss for metrics by @snarayan21 in https://github.com/mosaicml/composer/pull/3394
log image fix by @jessechancy in https://github.com/mosaicml/composer/pull/3286
[ckpt-rewr] Save state dict API by @eracah in https://github.com/mosaicml/composer/pull/3372
Revert "Optionally use flash-attn's CE loss for metrics (#3394)" by @snarayan21 in https://github.com/mosaicml/composer/pull/3408
CPU tests image fix by @snarayan21 in https://github.com/mosaicml/composer/pull/3409
Add setter for epoch in iteration by @b-chu in https://github.com/mosaicml/composer/pull/3407
Move pillow dep as required by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3412
fixing mlflow logging to Databricks workspace file paths with /Shared/ prefix by @JackZ-db in https://github.com/mosaicml/composer/pull/3410
Bump version v0.23.3 by @karan6181 in https://github.com/mosaicml/composer/pull/3414

New Contributors

@JackZ-db made their first contribution in https://github.com/mosaicml/composer/pull/3373

Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.2...v0.23.3

composer - v0.23.2

Published by bigning 5 months ago

Bug Fixes

Fix backward compatibility issue caused by missing eval metrics class

What's Changed:

Fix backward compatibility issue caused by missing eval metrics class by @bigning in https://github.com/mosaicml/composer/pull/3385

Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.1...release/v0.23.2

composer - v0.23.1

Published by mvpatel2000 5 months ago

What's New

1. PyTorch 2.3.1 Upgrade

Composer now supports PyTorch 2.3.1.

What's Changed

Torch 2.3.1 Upgrade by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3367
Fix monkeypatch imports by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3375
Remove unnecessary state dict and load_state_dict functions by @eracah in https://github.com/mosaicml/composer/pull/3361
Adding checkpoint backwards compatibility tests after 0.23.0 release by @bigning in https://github.com/mosaicml/composer/pull/3377
prepare_fsdp_module documentation fix by @KuuCi in https://github.com/mosaicml/composer/pull/3379
Composer version bump to v0.23.1 by @snarayan21 in https://github.com/mosaicml/composer/pull/3380
Clear caplog and use as context manager in test_logging by @snarayan21 in https://github.com/mosaicml/composer/pull/3382

Full Changelog: https://github.com/mosaicml/composer/compare/v0.23.0...v0.23.1

composer - v0.23.0

Published by bigning 5 months ago

What's New

1. Parallelism V2 + Tensor Parallel (#3335)

Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config attribute in the Trainer:

trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})

As we generalize to more forms of parallelism, we've deprecated fsdp_config in favor of parallelism_config:

trainer = Trainer(
    model = model,
    ...
    parallelism_config = {
        'fsdp': {
            'sharding_strategy': 'FULL_SHARD',
            'data_parallel_shard_degree': 2,      # Size of shard dimension
            'data_parallel_replicate_degree': 2,  # Size of replicate dimension
        },
        'tp_config': {
            'tensor_parallel_degree': 2,          # Size of TP dimension
            'layer_plan': ...  # describes how to TP layers
        }
    }
)

As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.

See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.

2. MLFLow API Simplification

Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:

mlflow_logger = MLFlowLogger(
    tracking_uri = 'databricks',
    experiment_name = '/Users/[email protected]/my-first-project/'
)

trainer = Trainer(
    model = model,
    ...
    loggers = mlflow_logger,
)

Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri and the experiment_name prefix:

trainer = Trainer(
    model = model,
    ...
    loggers = MLFlowLogger(experiment_name='my-first-project'),
)

3. Wallclock Save Interval

Composer now supports setting a save interval in wallclock time:

trainer = Trainer(
    model = model,
    ...
    save_interval='30m',
)

Note that most durations, such as max_duration, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval.

Bug Fixes

Don't close the engine if it's already closed in https://github.com/mosaicml/composer/pull/3143
Fix HF tests with Pin in https://github.com/mosaicml/composer/pull/3248
Fix backwards compatibility tests in https://github.com/mosaicml/composer/pull/3252
Fix unexpected remote checkpointing downloading in https://github.com/mosaicml/composer/pull/3271
Fix HSDP with ShardDegree < 8 in https://github.com/mosaicml/composer/pull/3313

What's Changed

Remove CPU offload for DDP/single-gpu by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3242
Adding more checkpoint backwards compatability tests by @snarayan21 in https://github.com/mosaicml/composer/pull/3244
Don't close the engine if its already closed by @dakinggg in https://github.com/mosaicml/composer/pull/3143
Replace evaluator.dataloader.device_eval_batch_size with evaluator.device_eval_microbatch_size by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3247
Fix HF tests with Pin by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3248
Remove ICL metrics by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3243
Add offset and length arguments for checkpoint validation functions by @irenedea in https://github.com/mosaicml/composer/pull/3246
Fix backwards compatibility tests, raise error for torch version mismatch by @snarayan21 in https://github.com/mosaicml/composer/pull/3252
Bump cryptography from 41.0.5 to 42.0.6 by @dependabot in https://github.com/mosaicml/composer/pull/3256
Bump databricks-sdk from 0.25.1 to 0.27.0 by @dependabot in https://github.com/mosaicml/composer/pull/3257
Improve GCS Object Store by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3251
add retry to gcs.upload_file by @bigning in https://github.com/mosaicml/composer/pull/3232
Add unit test support for full state dict + load_weights_only and save_weights_only by @eracah in https://github.com/mosaicml/composer/pull/3260
will/bump_aws_ofi_nccl by @willgleich in https://github.com/mosaicml/composer/pull/3253
Fix daily GCS tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3268
Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. by @Joqsan in https://github.com/mosaicml/composer/pull/3259
Add upload timeout patch to mlflow on azure by @dakinggg in https://github.com/mosaicml/composer/pull/3265
Add option to stagger uploads based on local rank by @dakinggg in https://github.com/mosaicml/composer/pull/3275
explicit close by @dakinggg in https://github.com/mosaicml/composer/pull/3276
Update NCCL_ASYNC_ERROR_HANDLING env variable by @priba in https://github.com/mosaicml/composer/pull/3267
new dist_cp save planner to fix issue that each rank needs to download all checkpoint files by @bigning in https://github.com/mosaicml/composer/pull/3271
Bump to torch 2.2.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3283
Fix UCObjectStore.list_objects by @dakinggg in https://github.com/mosaicml/composer/pull/3284
Update peft version by @dakinggg in https://github.com/mosaicml/composer/pull/3287
replace load_fsdp_monolith_ with load_monolith_ by @milocress in https://github.com/mosaicml/composer/pull/3288
Return PyTorch Latest by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3290
Fix daily tests by filtering a warning by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3291
remove orig_params check by @milocress in https://github.com/mosaicml/composer/pull/2981
[ckpt-rewr] Get Model State Dict Util Function by @eracah in https://github.com/mosaicml/composer/pull/3250
Skip compression check with symlink files by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3300
Monkeypatch Device Mesh ND Slicing by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3302
Bump coverage[toml] from 7.4.4 to 7.5.1 by @dependabot in https://github.com/mosaicml/composer/pull/3305
Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in https://github.com/mosaicml/composer/pull/3306
Update transformers requirement from !=4.34.0,<4.41,>=4.11 to >=4.11,!=4.34.0,<4.42 by @dependabot in https://github.com/mosaicml/composer/pull/3307
Allow overwrite on upload retry in remote uploader downloader by @irenedea in https://github.com/mosaicml/composer/pull/3310
Update platform references by @aspfohl in https://github.com/mosaicml/composer/pull/3304
Fix cometml unit tests by @j316chuck in https://github.com/mosaicml/composer/pull/3314
Fix HSDP with ShardDegree < 8 by @bigning in https://github.com/mosaicml/composer/pull/3313
Update docstring for get_model_state_dict by @eracah in https://github.com/mosaicml/composer/pull/3318
Tensor Parallelism Integration by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3269
Bugfixes to FSDP + TP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3323
Wct save interval by @KuuCi in https://github.com/mosaicml/composer/pull/3264
Wrap ChunkedEncodingError from UCObjectStore by @irenedea in https://github.com/mosaicml/composer/pull/3321
Add checkpoint events to mosaicml logger by @b-chu in https://github.com/mosaicml/composer/pull/3316
Bump timeout to fix daily tests by @j316chuck in https://github.com/mosaicml/composer/pull/3325
Fix FSDP ckpt by filtering User Waring by @j316chuck in https://github.com/mosaicml/composer/pull/3327
Revert TP integration by @dakinggg in https://github.com/mosaicml/composer/pull/3328
Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in https://github.com/mosaicml/composer/pull/3331
Bump sphinxcontrib-katex from 0.9.6 to 0.9.10 by @dependabot in https://github.com/mosaicml/composer/pull/3333
Update peft requirement from <0.11,>=0.10.0 to >=0.10.0,<0.12 by @dependabot in https://github.com/mosaicml/composer/pull/3332
Bump coverage[toml] from 7.5.1 to 7.5.2 by @dependabot in https://github.com/mosaicml/composer/pull/3330
Update protobuf requirement from <5.27 to <5.28 by @dependabot in https://github.com/mosaicml/composer/pull/3329
Improving memory snapshot by @cli99 in https://github.com/mosaicml/composer/pull/3315
Add A10 to speed monitor by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3336
change ComposerModel output type by @hyenal in https://github.com/mosaicml/composer/pull/3341
Remove evaluator state by @snarayan21 in https://github.com/mosaicml/composer/pull/3339
[ckpt-rewr] Generate Metadata State Dict API by @eracah in https://github.com/mosaicml/composer/pull/3311
Tensor Parallelism v2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3335
Migrate Type Hints for PEP 585 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3344
[checkpoint v2] add remote uploader class by @bigning in https://github.com/mosaicml/composer/pull/3303
Raise errors on all ranks for checkpoint download failures by @irenedea in https://github.com/mosaicml/composer/pull/3345
Add return type annotation when init doesn't take any argument by @antoinebrl in https://github.com/mosaicml/composer/pull/3347
[ckpt-rewr] Get Optim State Dict Util API by @eracah in https://github.com/mosaicml/composer/pull/3299
Fix type check issue with device train microbatch size by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3349
Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute by @j316chuck in https://github.com/mosaicml/composer/pull/3298
Bump coverage[toml] from 7.5.2 to 7.5.3 by @dependabot in https://github.com/mosaicml/composer/pull/3353
Update wandb requirement from <0.17,>=0.13.2 to >=0.13.2,<0.18 by @dependabot in https://github.com/mosaicml/composer/pull/3352
Optional CheckpointSaver instantiation inside the Trainer by @antoinebrl in https://github.com/mosaicml/composer/pull/3334
MLFlow better experiment defaults by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3356
Rename metadata keys by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3354
Dataclasses for ParallelismConfig by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3346
Upgrade Mofed with apt by @willgleich in https://github.com/mosaicml/composer/pull/3340
Multi gpu ci test by @KuuCi in https://github.com/mosaicml/composer/pull/3312
Autoresume Validation with Max Duration by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3358
Deprecate and bump verstion to 0.23.0 by @bigning in https://github.com/mosaicml/composer/pull/3359

New Contributors

@Joqsan made their first contribution in https://github.com/mosaicml/composer/pull/3259

Full Changelog: https://github.com/mosaicml/composer/compare/v0.22.0...v0.23.0

composer - v0.22.0

Published by snarayan21 6 months ago

What's New

🔥 Support for PyTorch v2.3.0

Composer now supports the recently-released PyTorch version 2.3.0! Please raise any issues with us so we can address them.

Bug Fixes

Fixing checks for device microbatch size for sequence parallelism in #3200
Fixing token logging in #3206
Search for run name in MLFlowLogger in #3215
Fix FQN names with activation checkpointing in #3210
Strict weight matching for checkpoint loading in #3219

What's Changed

Bump transformers by @dakinggg in https://github.com/mosaicml/composer/pull/3197
Add deprecation warnings for ICL datasets/helper functions/metrics by @bmosaicml in https://github.com/mosaicml/composer/pull/3125
Bump traitlets from 5.14.2 to 5.14.3 by @dependabot in https://github.com/mosaicml/composer/pull/3204
Raise LR schedule warnings only when necessary by @snarayan21 in https://github.com/mosaicml/composer/pull/3207
Add torch 2.3 support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3209
Add torch 2.3 CI/CD by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3211
Fix daily test images by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3212
Try FAv2 2.5.7 from source by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3213
Update tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3217
Fix torch 2.3 GPU tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3218
Use flash-attn 2.5.8 with no build isolation in docker images by @snarayan21 in https://github.com/mosaicml/composer/pull/3224
Add a torch.cuda.empty_cache() in utils.save_checkpoint by @bfontain in https://github.com/mosaicml/composer/pull/3216
Require 2 steps for GS object store by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3228
Add rename_metrics to Mlflow logger by @hanlint in https://github.com/mosaicml/composer/pull/3225
Fix daily tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3229
Change precision for daily tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3231
Create new Mlflow run by default and introduce run_group by @chenmoneygithub in https://github.com/mosaicml/composer/pull/3208
Fix daily test pt 4 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3233
Deprecate and bump version to 0.22 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3230
Fix daily tests v5 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3234
Fix daily v6 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3235
fix daily tests v7 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3236
Raise the daily test timeout by @dakinggg in https://github.com/mosaicml/composer/pull/3241
Accelerate GPU tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3237
Make sharded checkpoint loading backwards-compatible by @snarayan21 in https://github.com/mosaicml/composer/pull/3240

Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.3...v0.22.0

composer - v0.21.3

Published by mvpatel2000 6 months ago

Bug Fixes

1. Increased Robustness to Checkpoint Loading

We've patched several edge cases in loading sharded checkpoints, especially with DTensors, which should decrease memory usage when loading checkpoints. We've also hardened retry logic against object cloud failure, ensuring higher robustness to transient network issues.

What's Changed

Raise daily test timeout by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3172
fix remote file naming by @cli99 in https://github.com/mosaicml/composer/pull/3173
[fix] DTensor + SHARD_GRAD_OP + use_orig_params by @bigning in https://github.com/mosaicml/composer/pull/3175
Bump db sdk by @dakinggg in https://github.com/mosaicml/composer/pull/3176
Build latest pytorch nightly images by @dakinggg in https://github.com/mosaicml/composer/pull/3179
Add FP8 TransformerEngine activation checkpointing by @cli99 in https://github.com/mosaicml/composer/pull/3156
Enabling the computation of validation loss and other metrics when using sequence parallelism by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3183
Update mosaic_fsdp_utils.py by @vchiley in https://github.com/mosaicml/composer/pull/3185
Fix the FSDP.optim_state_dict_to_load OOM by @bigning in https://github.com/mosaicml/composer/pull/3184
Revert "Update mosaic_fsdp_utils.py" by @vchiley in https://github.com/mosaicml/composer/pull/3187
Bump databricks-sdk from 0.24.0 to 0.25.1 by @dependabot in https://github.com/mosaicml/composer/pull/3190
Add version tag to local builds by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3188
Update NeptuneLogger by @AleksanderWWW in https://github.com/mosaicml/composer/pull/3165
Filter neptune warning in doctests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3195
Removal of metrics deepcopy before computing the metrics by @gregjauvion in https://github.com/mosaicml/composer/pull/3180
Fix MLFlow Tag Name for Resumption by @KuuCi in https://github.com/mosaicml/composer/pull/3194
Fix mistral gating by @dakinggg in https://github.com/mosaicml/composer/pull/3199
Bump version to 0.21.3 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3198

New Contributors

@gregjauvion made their first contribution in https://github.com/mosaicml/composer/pull/3180

Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.2...v0.21.3

composer - v0.21.2

Published by mvpatel2000 7 months ago

Bug Fixes

1. Enable torch 2.2.2 (#3161)

Composer currently monkeypatches PyTorch for nightly versions in order to fix upstream bugs. With the release of torch 2.2.2, these monkeypatches were mistakenly applied to the stable release due to incorrect gating on imports. This release fixes the gating, enabling torch 2.2.2.

2. MPS Metric Computation on CPU (#3105)

Due to bugs in computing torchmetrics on Mac devices, we move metric computation onto CPU. This previously had issues with data not properly moving to CPU.

Thank you to @hyenal for this contribution!

3. Batch Sampler Support (#3105)

Composer now supports batch sampler, which previously resulted in an error if specified in the dataloader.

Thank you to @Ghelfi for this contribution!

What's Changed

Make codequality callable by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3133
Explicitly print checkpoint downloading exception by @bigning in https://github.com/mosaicml/composer/pull/3131
Change release actions by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3136
Passing rank and num_replicas to dist.get_sampler by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3137
Fix broadcast by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3138
Compressor fixes by @mbway in https://github.com/mosaicml/composer/pull/3142
In case of MPS device also copy batch to CPU by @hyenal in https://github.com/mosaicml/composer/pull/3105
Composer object store download retry by @bigning in https://github.com/mosaicml/composer/pull/3140
Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in https://github.com/mosaicml/composer/pull/3144
Update transformers requirement from !=4.34.0,<4.39,>=4.11 to >=4.11,!=4.34.0,<4.40 by @dependabot in https://github.com/mosaicml/composer/pull/3148
Update protobuf requirement from <3.21 to <5.27 by @dependabot in https://github.com/mosaicml/composer/pull/3147
Bump traitlets from 5.14.1 to 5.14.2 by @dependabot in https://github.com/mosaicml/composer/pull/3145
Bump to 0.21 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3150
Fixing sequence parallel error conditions and adding type float for microbatch_size in typehints by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3139
Fix torch monkeypatch version check by @dakinggg in https://github.com/mosaicml/composer/pull/3155
Update torchmetrics requirement from <1.3.2,>=0.10.0 to >=0.10.0,<1.3.3 by @dependabot in https://github.com/mosaicml/composer/pull/3157
Bump gitpython from 3.1.42 to 3.1.43 by @dependabot in https://github.com/mosaicml/composer/pull/3160
Prevent crash if signal handler cannot be set by @mbway in https://github.com/mosaicml/composer/pull/3152
Pin pillow for code quality workflow by @dakinggg in https://github.com/mosaicml/composer/pull/3162
Fix torch version check by @dakinggg in https://github.com/mosaicml/composer/pull/3161
add more retry to checkpoint downloading by @bigning in https://github.com/mosaicml/composer/pull/3164
Append to gpu rank log files instead of throwing error by @jjanezhang in https://github.com/mosaicml/composer/pull/3166
Call set_epoch on Dataloader.batch_sampler if defined by @Ghelfi in https://github.com/mosaicml/composer/pull/3124
Bump version to 0.21.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3168

New Contributors

@hyenal made their first contribution in https://github.com/mosaicml/composer/pull/3105
@Ghelfi made their first contribution in https://github.com/mosaicml/composer/pull/3124

Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.1...v0.21.2

composer - v0.21.1

Published by mvpatel2000 7 months ago

Bug Fixes

1. Fix to HSDP checkpoint loading

The previous release broke checkpoint loading when using HSDP with mutliple replicas. This patch release fixes checkpoint loading.

What's Changed

Fix broadcast by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3138

Full Changelog: https://github.com/mosaicml/composer/compare/v0.21.0...v0.21.1

composer - v0.21.0

Published by mvpatel2000 7 months ago

What's New

1. Aggregate Memory Monitoring (#3042)

The Memory Monitor callback now supports aggregating memory statistics across nodes. Getting summary stats for a run's memory usage across the cluster can dramatically help debug straggler nodes or non-homogenous workloads. The memory monitor can now aggregate and log combined values at a user specified frequency.

Example:

from composer import Trainer
from composer.callbacks import MemoryMonitor

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        MemoryMonitor(
            dist_aggregate_batch_interval=10,  # aggregate every 10 batches
        )
    ],
)

2. Advanced Compression Options (#3118)

Large model checkpoints can be expensive to store and transfer. In this release, we've upgraded our compression support to accept several new formats which result in better compression-time tradeoffs using CLI tools. In order to use compression, you can post-fix your checkpoint name with a compression path. We know support the following extensions:

bz2
gz
lz4
lzma
lzo
xz
zst

Example:

from composer import Trainer
from composer.callbacks import MemoryMonitor

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    save_filename='ep{epoch}-ba{batch}-rank{rank}.pt.lz4',
)

Thank you to @mbway for adding this support!

What's Changed

Rename composer_run_name tag to run_name when logging to MLflow by @jerrychen109 in https://github.com/mosaicml/composer/pull/3040
enable aggregate mem monitoring by @vchiley in https://github.com/mosaicml/composer/pull/3042
Bump junitparser from 3.1.1 to 3.1.2 by @dependabot in https://github.com/mosaicml/composer/pull/3056
Add SHARD_GRAD_OP to device mesh error check by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3058
Add torch 2.2.1 support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3059
Use testing repo actions for linting by @b-chu in https://github.com/mosaicml/composer/pull/3060
Link autoresume docs back to watchdog by @aspfohl in https://github.com/mosaicml/composer/pull/3052
Deprecate get_state and remove deprecations by @b-chu in https://github.com/mosaicml/composer/pull/3017
Bump version to 0.20.1 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3061
Remove s3_bucket pytest cli flag by @b-chu in https://github.com/mosaicml/composer/pull/3064
Remove s3_bucket flag from gpu test by @b-chu in https://github.com/mosaicml/composer/pull/3065
Clean Up OOM Observer Remote Uploader Download path by @j316chuck in https://github.com/mosaicml/composer/pull/3070
Fix daily test for iteration by @b-chu in https://github.com/mosaicml/composer/pull/3068
Remove "generation_length" in favor of "generation_kwargs" by @maxisawesome in https://github.com/mosaicml/composer/pull/3014
Bump packaging by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3072
Use ci-testing repo for CPU and GPU tests by @b-chu in https://github.com/mosaicml/composer/pull/3062
Add new torch monkeypatches to Composer by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3063
Add initial support for neuron devices by @bfontain in https://github.com/mosaicml/composer/pull/3049
Stripping whitespaces as default for QATask ICL eval by @ksreenivasan in https://github.com/mosaicml/composer/pull/3073
Add ICL base class to all by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3079
pass prelimiter into ALL ICL datasets by @eitanturok in https://github.com/mosaicml/composer/pull/3069
Bump sentencepiece from 0.1.99 to 0.2.0 by @dependabot in https://github.com/mosaicml/composer/pull/3083
Add Iteration related Events to callbacks by @b-chu in https://github.com/mosaicml/composer/pull/3077
Add Iteration related Events by @b-chu in https://github.com/mosaicml/composer/pull/3076
Bump CI/CD to v3 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3086
Add docstring to _iteration_length by @b-chu in https://github.com/mosaicml/composer/pull/3088
Check FSDP module has _device_mesh before getting it by @eracah in https://github.com/mosaicml/composer/pull/3091
Bump minor version in base image by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3092
Enforce async logging flush in mlflow logger at post_close call by @chenmoneygithub in https://github.com/mosaicml/composer/pull/3093
Warning log to info log by @aspfohl in https://github.com/mosaicml/composer/pull/3096
Bump transformers by @dakinggg in https://github.com/mosaicml/composer/pull/3095
Change style for splitting on commas by @b-chu in https://github.com/mosaicml/composer/pull/3078
Remove slash by @b-chu in https://github.com/mosaicml/composer/pull/3098
Allowing for fractional number of samples per rank by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3075
Output eval logging (batch level) by @maxisawesome in https://github.com/mosaicml/composer/pull/2977
Replace errors with warnings for eval args by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3100
Ability to load sharded checkpoints with remote symlink load_path by @eracah in https://github.com/mosaicml/composer/pull/3097
Improvements to NeptuneLogger by @AleksanderWWW in https://github.com/mosaicml/composer/pull/3085
Revert "Improvements to NeptuneLogger" by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3111
Bump mlflow min pin by @dakinggg in https://github.com/mosaicml/composer/pull/3110
Fix rounding issue in interval calculation by @dakinggg in https://github.com/mosaicml/composer/pull/3109
Bump coverage[toml] from 7.4.1 to 7.4.3 by @dependabot in https://github.com/mosaicml/composer/pull/3102
Uses v0.0.4 of ci-testing by @b-chu in https://github.com/mosaicml/composer/pull/3112
Add versioned deprecation warning by @irenedea in https://github.com/mosaicml/composer/pull/2984
Update Flash Attention to 2.5.5 by @Skylion007 in https://github.com/mosaicml/composer/pull/3113
Setting the max duration to current timestamp in the same units as cu… by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3090
Making default_split_batch public by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/3116
Adding log exception to Mosaic Logger by @jjanezhang in https://github.com/mosaicml/composer/pull/3089
Add checks to schedulers by @b-chu in https://github.com/mosaicml/composer/pull/3115
Removed default attrs from exception class in the attrs dict by @jjanezhang in https://github.com/mosaicml/composer/pull/3126
Bump coverage[toml] from 7.4.3 to 7.4.4 by @dependabot in https://github.com/mosaicml/composer/pull/3121
Refactor initialization by @Practicinginhell in https://github.com/mosaicml/composer/pull/3127
Bump databricks sdk version by @dakinggg in https://github.com/mosaicml/composer/pull/3128
Update packaging requirement from <23.3,>=21.3.0 to >=21.3.0,<24.1 by @dependabot in https://github.com/mosaicml/composer/pull/3122
Remove rng from save_weights_only ckpt by @eracah in https://github.com/mosaicml/composer/pull/3129
More compression options by @mbway in https://github.com/mosaicml/composer/pull/3118
Only broadcast distcp files by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3130
Bump version to 0.21 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3132

New Contributors

@ksreenivasan made their first contribution in https://github.com/mosaicml/composer/pull/3073
@eitanturok made their first contribution in https://github.com/mosaicml/composer/pull/3069
@Practicinginhell made their first contribution in https://github.com/mosaicml/composer/pull/3127
@mbway made their first contribution in https://github.com/mosaicml/composer/pull/3118

Full Changelog: https://github.com/mosaicml/composer/compare/v0.20.1...v0.21.0

composer - v0.20.1

Published by mvpatel2000 8 months ago

What's New

1. Torch 2.2.1 Support

Composer now supports torch 2.2.1! We've raised the pin to allow the latest torch, and we've upstreamed all torch monkeypatches so Composer can run out of the box with the latest and greatest torch features.

What's Changed

Add torch 2.2.1 support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3059
Bump version to 0.20.1 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3061

composer - v0.20.0

Published by j316chuck 8 months ago

What's New

1. New Neptune Logger

Composer now supports logging training data to neptune.ai using the NeptuneLogger. To get started:

neptune_project = 'test_project'
neptune_api_token = 'test_token'

neptune_logger = NeptuneLogger(
    project=neptune_project,
    api_token=neptune_api_token,
    rank_zero_only=False,
    mode='debug',
    upload_artifacts=True,
)

We also have an example project demonstrating all the awesome things you can do with this integration!

Additional information on the NeptuneLogger can be found in the docs.

2. OOM observer callback with memory visualizations

Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.

Example:

from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        OOMObserver(
            folder="traces",
            overwrite=true,
            filename="rank{rank}_oom",
            remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
        )
    ],
)

OOM Visualization:

Screenshot 2024-02-23 at 9.30.03 AM

3. Log all gpu rank stdout/err to MosaicML platform

Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.

Example commands:

mcli logs <run-name> --node x --gpu x

Note, this defaults to node rank 0 if --node is not provided.

Also, we can find the logs of any global gpu rank with the command:

mcli logs <run-name> --global-gpu-rank x

Bug Fixes

Only save RNG on rank 0 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2998
[Auto-microbatch fix] FSDP reshard and cleanup after OOM to fix the cuda memory leak by @bigning in https://github.com/mosaicml/composer/pull/3030
Fix skip_first for profiler during resumption by @bigning in https://github.com/mosaicml/composer/pull/2986
Race condition fix in checkpoint loading util by @jessechancy in https://github.com/mosaicml/composer/pull/3001

What's Changed

Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in https://github.com/mosaicml/composer/pull/2957
Modify UCObjectStore.list_objects to lists all files recursively by @irenedea in https://github.com/mosaicml/composer/pull/2959
Refactor MemorySnapshot by @cli99 in https://github.com/mosaicml/composer/pull/2960
Log all gpu rank stdout/err to MosaicML platform by @jjanezhang in https://github.com/mosaicml/composer/pull/2839
Add Torch 2.2 tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2970
Memory snapshot dump pickle by @cli99 in https://github.com/mosaicml/composer/pull/2968
Neptune logger by @AleksanderWWW in https://github.com/mosaicml/composer/pull/2447
Fix torch pins in tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2973
Add a register_model_with_run_id api to MLflowLogger by @dakinggg in https://github.com/mosaicml/composer/pull/2967
Remove bespoke codeowners by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2971
Add a BEFORE_LOAD event by @snarayan21 in https://github.com/mosaicml/composer/pull/2974
More torch 2.2 fixes by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2975
Adding the step argument to logger.log_table by @ShashankMosaicML in https://github.com/mosaicml/composer/pull/2961
Fix daily tests for torch 2.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2980
Format load_path with name by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2978
Bump to 0.19.1 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2979
Fix UC object store bugfix by @nancyhung in https://github.com/mosaicml/composer/pull/2982
[Bugfix][UC] Add back the full object path by @nancyhung in https://github.com/mosaicml/composer/pull/2988
Minor cleanup of UC get_object_size by @dakinggg in https://github.com/mosaicml/composer/pull/2989
Pin UC to earlier version by @dakinggg in https://github.com/mosaicml/composer/pull/2990
Revert "fix skip_first for resumption" by @bigning in https://github.com/mosaicml/composer/pull/2991
Broadcast files for HSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2914
Bump ipykernel from 6.29.0 to 6.29.2 by @dependabot in https://github.com/mosaicml/composer/pull/2994
Bump yamllint from 1.33.0 to 1.34.0 by @dependabot in https://github.com/mosaicml/composer/pull/2995
Refactor update_metric by @maxisawesome in https://github.com/mosaicml/composer/pull/2965
Add azure integration test by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2996
Fix Profiler schedule skip_first by @bigning in https://github.com/mosaicml/composer/pull/2992
Remove planner validation by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2985
Fix load for non-HSDP device mesh by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2997
Update NCCL arg since torch deprecated old one by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3000
Add bias argument to LPLN by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2999
Revert "Add bias argument to LPLN" by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3003
Revert "Update NCCL arg since torch deprecated old one" by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3004
Add torch 2.3 image for aws cluster by @j316chuck in https://github.com/mosaicml/composer/pull/3002
Patch torch 2.3 aws naming by @j316chuck in https://github.com/mosaicml/composer/pull/3006
Add debug log before training loop starts by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3005
Deprecate ffcv code by @j316chuck in https://github.com/mosaicml/composer/pull/3007
Remove log for mosaicml logger by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3008
[EASY] Always log 1st batch when resuming training by @bigning in https://github.com/mosaicml/composer/pull/3009
Use reusable actions for linting by @b-chu in https://github.com/mosaicml/composer/pull/2948
Make CodeEval respect device_eval_batch_size by @josejg in https://github.com/mosaicml/composer/pull/2969
Use Mosaic constant for GPU file prefix by @jjanezhang in https://github.com/mosaicml/composer/pull/3018
Fall back to normal logging when gpu prefix is not present by @jjanezhang in https://github.com/mosaicml/composer/pull/3020
Revert "Use reusable actions for linting" to fix CI/CD by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3023
Change to pull_request_target by @b-chu in https://github.com/mosaicml/composer/pull/3025
Bump gitpython from 3.1.41 to 3.1.42 by @dependabot in https://github.com/mosaicml/composer/pull/3031
Bump yamllint from 1.34.0 to 1.35.1 by @dependabot in https://github.com/mosaicml/composer/pull/3034
Update torchmetrics requirement from <1.3.1,>=0.10.0 to >=0.10.0,<1.3.2 by @dependabot in https://github.com/mosaicml/composer/pull/3035
Bump pypandoc from 1.12 to 1.13 by @dependabot in https://github.com/mosaicml/composer/pull/3033
Add tensorboard images support by @Menduist in https://github.com/mosaicml/composer/pull/3021
Add sorted to logs for checkpoint broadcast by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3036
Friendlier device mesh error by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3039
Upgrade to python3.11 for torch nightly by @j316chuck in https://github.com/mosaicml/composer/pull/3038
Download symlink once by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3043
Add min size to OCI download by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3044
Lint fix by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3045
Revert "Change to pull_request_target " by @mvpatel2000 in https://github.com/mosaicml/composer/pull/3047
Bump composer version 0.19.2 by @j316chuck in https://github.com/mosaicml/composer/pull/3048
Update XLA support by @bfontain in https://github.com/mosaicml/composer/pull/2964
Bump composer version 0.20.0 by @j316chuck in https://github.com/mosaicml/composer/pull/3051
Update ruff. Fix PLE & LOG lints by @Skylion007 in https://github.com/mosaicml/composer/pull/3050

New Contributors

@AleksanderWWW made their first contribution in https://github.com/mosaicml/composer/pull/2447
@ShashankMosaicML made their first contribution in https://github.com/mosaicml/composer/pull/2961
@nancyhung made their first contribution in https://github.com/mosaicml/composer/pull/2982
@bigning made their first contribution in https://github.com/mosaicml/composer/pull/2986
@jessechancy made their first contribution in https://github.com/mosaicml/composer/pull/3001
@josejg made their first contribution in https://github.com/mosaicml/composer/pull/2969
@Menduist made their first contribution in https://github.com/mosaicml/composer/pull/3021
@bfontain made their first contribution in https://github.com/mosaicml/composer/pull/2964

Full Changelog: https://github.com/mosaicml/composer/compare/v0.19.1...v0.20.0

composer - v0.19.1

Published by milocress 9 months ago

What's New

1. New Event: BEFORE_LOAD (#2974)

Composer now has the events Event.BEFORE_LOAD, which lets users modify state before a model is loaded. This is particularly useful for accessing certain attributes which may not exist at Event.INIT, such as the dataloader state.

2. Registering model in MLFlow with run id (#2967)

The MLFlow logger now has register_model_with_run_id, which allows users to register a model based on the run_id. This is a different way of registering the model which preserves the link to the mlflow runs.

What's Changed

before_load event added https://github.com/mosaicml/composer/pull/2974
Add a register_model_with_run_id api to MLflowLogger https://github.com/mosaicml/composer/pull/2967

Full Changelog: https://github.com/mosaicml/composer/compare/v0.19.0...v0.19.1

composer - v0.19.0

Published by j316chuck 9 months ago

What's New

1. Improved DTensor Support

Composer now supports elastic saving and loading of DTensors at various mesh sizes.

2. Checkpoint Saving and Loading from Databricks MLFlow

Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.

composer_model = MyComposerModel(...)

trainer = Trainer(
      model=composer_model,
      save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      logger=MLFlowLogger(...),
      load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      ...
)

3. Better Communication Computation Overlap in FSDP

Composer now has improved communication/computation overlap in our FSDP code which should improve MFU across several architectures.

4. Python3.11 + Torch2.2 Support

Initial support of Python3.11 + Torch2.2 added in Composer.

5. PEFT LoRA

PEFT LoRA is now supported in the HuggingFaceModel class.

6. Refactored Evaluation

in_context_learning_evaluation.py has a new design with cleaner abstractions and easier interfaces to work wtih.

7. Azure Checkpointing

Composer now supports saving your model in Azure.

8. MLFlow Checkpointing

Composer now supports saving your model in MLFlow.

Bug Fixes

Fix MLFlowLogger test by @ngcgarcia in https://github.com/mosaicml/composer/pull/2912
Fix bug with CoT early stopping and LLama2 tokenizer by @bmosaicml in https://github.com/mosaicml/composer/pull/2902
Fix split_batch bug with empty generation_kwargs by @maxisawesome in https://github.com/mosaicml/composer/pull/2913
Only load RNG keys that exist by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2901
Fix daily tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2891
Fix seed for FSDP wrap by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2833
Fix load_ignore_keys with rng by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2803
Fix mosaicml logger on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2816
Fix torch profiler error on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2818
Fix import for daily test by @snarayan21 in https://github.com/mosaicml/composer/pull/2826
Fix how single value tensors are logged by @aspfohl in https://github.com/mosaicml/composer/pull/2831
Fix torch bump by @j316chuck in https://github.com/mosaicml/composer/pull/2855
Fix MPS with sequence loss by @JAEarly in https://github.com/mosaicml/composer/pull/2834

What's Changed

Bump transformers version by @dakinggg in https://github.com/mosaicml/composer/pull/2781
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in https://github.com/mosaicml/composer/pull/2784
Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in https://github.com/mosaicml/composer/pull/2783
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in https://github.com/mosaicml/composer/pull/2785
[UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in https://github.com/mosaicml/composer/pull/2789
Enable system metrics in mosaic mlflow logger by @chenmoneygithub in https://github.com/mosaicml/composer/pull/2775
Update parse_uri by @irenedea in https://github.com/mosaicml/composer/pull/2787
default to no torch profiler memory timeline by @cli99 in https://github.com/mosaicml/composer/pull/2790
Add eot token to ICL generate kwargs by @bmosaicml in https://github.com/mosaicml/composer/pull/2782
Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in https://github.com/mosaicml/composer/pull/2791
Add torch nightly 12-13 by @j316chuck in https://github.com/mosaicml/composer/pull/2792
Add process group as arg to FSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2794
Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in https://github.com/mosaicml/composer/pull/2798
Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in https://github.com/mosaicml/composer/pull/2806
Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in https://github.com/mosaicml/composer/pull/2805
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in https://github.com/mosaicml/composer/pull/2807
Avoid futures on close for MosaicML logger by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2804
Require sync module states with HSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2812
Better communication computation overlap by @snarayan21 in https://github.com/mosaicml/composer/pull/2811
Improve error message for speed monitor by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2801
Bump torch version -- DO NOT RELEASE by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2814
Bump torchvision for nightly by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2815
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in https://github.com/mosaicml/composer/pull/2817
Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in https://github.com/mosaicml/composer/pull/2822
All unshard streams wait on computation every step by @snarayan21 in https://github.com/mosaicml/composer/pull/2823
Add encoding=utf-8 by @dakinggg in https://github.com/mosaicml/composer/pull/2824
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in https://github.com/mosaicml/composer/pull/2802
Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2827
checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in https://github.com/mosaicml/composer/pull/2819
code-quality timeout update by @aspfohl in https://github.com/mosaicml/composer/pull/2830
Adds DTensor Support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2821
Remove duplicate checkpoint verifications by @eracah in https://github.com/mosaicml/composer/pull/2828
Remove fsdp patch for comm overlap by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2836
Allow hsdp by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2838
Bump torch 2.1.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2840
Upgrade pyright to 1.1.310 by @b-chu in https://github.com/mosaicml/composer/pull/2841
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in https://github.com/mosaicml/composer/pull/2810
update nightly to torch 2.3 by @j316chuck in https://github.com/mosaicml/composer/pull/2842
Pin sphinxcontrib applehelp by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2854
Torch 2.3 patch by @dakinggg in https://github.com/mosaicml/composer/pull/2849
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in https://github.com/mosaicml/composer/pull/2866
Rewrite to use individual state functions by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2860
Add custom stopping criteria to ICL generate tasks by @bmosaicml in https://github.com/mosaicml/composer/pull/2800
Add save_ignore_keys by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2868
Remome log debug by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2871
Update monkeypatch to put barrier in optim load by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2874
Remove toml by @b-chu in https://github.com/mosaicml/composer/pull/2872
Update license by @b-chu in https://github.com/mosaicml/composer/pull/2875
Add ignore_metrics field to the MLflow logger by @ngcgarcia in https://github.com/mosaicml/composer/pull/2869
Convert print to log.info by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2876
Bump version to 0.18.0 by @irenedea in https://github.com/mosaicml/composer/pull/2877
Removed commented-out unshard streams patching. by @snarayan21 in https://github.com/mosaicml/composer/pull/2873
Make code quality workflow reusable by @b-chu in https://github.com/mosaicml/composer/pull/2878
Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in https://github.com/mosaicml/composer/pull/2885
Bump torchmetrics by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2890
Bump transformers to 4.37 by @dakinggg in https://github.com/mosaicml/composer/pull/2894
Azure checkpointing support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2893
Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2897
Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2899
Bump version to 0.18.1 by @b-chu in https://github.com/mosaicml/composer/pull/2905
Refactor in_context_learning_evaluation.py by @maxisawesome in https://github.com/mosaicml/composer/pull/2713
Fix FP8 checkpoint resumption with onnx export flag by @j316chuck in https://github.com/mosaicml/composer/pull/2907
Add Python 3.11 + FA 2.5.0 + Torch 2.3.0 Image by @KuuCi in https://github.com/mosaicml/composer/pull/2898
Add yamllint to pre commit by @b-chu in https://github.com/mosaicml/composer/pull/2909
Add ignore_hyperparameters to MLFlowLogger by @ngcgarcia in https://github.com/mosaicml/composer/pull/2908
Bump coverage[toml] from 7.3.4 to 7.4.1 by @dependabot in https://github.com/mosaicml/composer/pull/2915
Add checkpoint test for 0.18.1 by @b-chu in https://github.com/mosaicml/composer/pull/2906
Integrate PEFT LoRA with HuggingFaceModel by @dakinggg in https://github.com/mosaicml/composer/pull/2829

New Contributors

@jerrychen109 made their first contribution in https://github.com/mosaicml/composer/pull/2802
@JAEarly made their first contribution in https://github.com/mosaicml/composer/pull/2834
@maxisawesome made their first contribution in https://github.com/mosaicml/composer/pull/2713

Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.2...v0.19.0

composer - v0.18.2

Published by b-chu 9 months ago

Bug Fixes

Fix lp layernorm weight by @snarayan21 in https://github.com/mosaicml/composer/pull/2954

What's Changed

Fix lp layernorm weight by @snarayan21 in https://github.com/mosaicml/composer/pull/2954
Bump version to 0.18.2 by @b-chu

Full Changelog: https://github.com/mosaicml/composer/compare/v0.18.1...v0.18.2

composer - v0.18.1

Published by b-chu 9 months ago

Bug Fixes

Fix MPS with sequence loss by @JAEarly in https://github.com/mosaicml/composer/pull/2834
Fix daily tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2891
Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2899
Only load RNG keys that exist by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2901

What's Changed

Bump version to 0.18.0 by @irenedea in https://github.com/mosaicml/composer/pull/2877
Removed commented-out unshard streams patching. by @snarayan21 in https://github.com/mosaicml/composer/pull/2873
Make code quality workflow reusable by @b-chu in https://github.com/mosaicml/composer/pull/2878
Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in https://github.com/mosaicml/composer/pull/2885
Fix MPS with sequence loss by @JAEarly in https://github.com/mosaicml/composer/pull/2834
Bump torchmetrics by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2890
Fix daily tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2891
Bump transformers to 4.37 by @dakinggg in https://github.com/mosaicml/composer/pull/2894
Azure checkpointing support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2893
Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2897
Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2899
Only load RNG keys that exist by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2901
Bump version to 0.18.1 by @b-chu in https://github.com/mosaicml/composer/pull/2905

New Contributors

@JAEarly made their first contribution in https://github.com/mosaicml/composer/pull/2834

Full Changelog: https://github.com/mosaicml/composer/compare/v0.18.0...v0.18.1

composer - v0.18.0

Published by b-chu 9 months ago

This release has been yanked, please skip directly to Composer v0.18.1

New Features

1. Improved DTensor Support

Composer now supports elastic saving and loading of DTensors at various mesh sizes.

2. Checkpoint Saving and Loading from Databricks MLFlow

Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.

composer_model = MyComposerModel(...)

trainer = Trainer(
      model=composer_model,
      save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      logger=MLFlowLogger(...),
      load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      ...
)

Bug Fixes

Fix load_ignore_keys with rng by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2803
Fix mosaicml logger on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2816
Fix torch profiler error on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2818
Fix import for daily test by @snarayan21 in https://github.com/mosaicml/composer/pull/2826
[S] Fix how single value tensors are logged by @aspfohl in https://github.com/mosaicml/composer/pull/2831

Deprecations

Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2827

What's Changed

Bump transformers version by @dakinggg in https://github.com/mosaicml/composer/pull/2781
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in https://github.com/mosaicml/composer/pull/2784
Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in https://github.com/mosaicml/composer/pull/2783
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in https://github.com/mosaicml/composer/pull/2785
[UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in https://github.com/mosaicml/composer/pull/2789
Enable system metrics in mosaic mlflow logger by @chenmoneygithub in https://github.com/mosaicml/composer/pull/2775
Update parse_uri by @irenedea in https://github.com/mosaicml/composer/pull/2787
default to no torch profiler memory timeline by @cli99 in https://github.com/mosaicml/composer/pull/2790
Add eot token to ICL generate kwargs by @bmosaicml in https://github.com/mosaicml/composer/pull/2782
Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in https://github.com/mosaicml/composer/pull/2791
Add torch nightly 12-13 by @j316chuck in https://github.com/mosaicml/composer/pull/2792
Add process group as arg to FSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2794
Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in https://github.com/mosaicml/composer/pull/2798
Fix load_ignore_keys with rng by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2803
Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in https://github.com/mosaicml/composer/pull/2806
Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in https://github.com/mosaicml/composer/pull/2805
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in https://github.com/mosaicml/composer/pull/2807
Avoid futures on close for MosaicML logger by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2804
Require sync module states with HSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2812
Better communication computation overlap by @snarayan21 in https://github.com/mosaicml/composer/pull/2811
Improve error message for speed monitor by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2801
Bump torch version -- DO NOT RELEASE by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2814
Bump torchvision for nightly by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2815
Fix mosaicml logger on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2816
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in https://github.com/mosaicml/composer/pull/2817
Fix torch profiler error on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2818
Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in https://github.com/mosaicml/composer/pull/2822
All unshard streams wait on computation every step by @snarayan21 in https://github.com/mosaicml/composer/pull/2823
Add encoding=utf-8 by @dakinggg in https://github.com/mosaicml/composer/pull/2824
Fix import for daily test by @snarayan21 in https://github.com/mosaicml/composer/pull/2826
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in https://github.com/mosaicml/composer/pull/2802
Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2827
checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in https://github.com/mosaicml/composer/pull/2819
code-quality timeout update by @aspfohl in https://github.com/mosaicml/composer/pull/2830
[S] Fix how single value tensors are logged by @aspfohl in https://github.com/mosaicml/composer/pull/2831
Adds DTensor Support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2821
Remove duplicate checkpoint verifications by @eracah in https://github.com/mosaicml/composer/pull/2828
Fix seed for FSDP wrap by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2833
Remove fsdp patch for comm overlap by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2836
Allow hsdp by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2838
Bump torch 2.1.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2840
Upgrade pyright to 1.1.310 by @b-chu in https://github.com/mosaicml/composer/pull/2841
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in https://github.com/mosaicml/composer/pull/2810
update nightly to torch 2.3 by @j316chuck in https://github.com/mosaicml/composer/pull/2842
Pin sphinxcontrib applehelp by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2854
Fix torch bump by @j316chuck in https://github.com/mosaicml/composer/pull/2855
Torch 2.3 patch by @dakinggg in https://github.com/mosaicml/composer/pull/2849
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in https://github.com/mosaicml/composer/pull/2866
Rewrite to use individual state functions by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2860
Add custom stopping criteria to ICL generate tasks by @bmosaicml in https://github.com/mosaicml/composer/pull/2800
Add save_ignore_keys by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2868
Remome log debug by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2871
Update monkeypatch to put barrier in optim load by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2874
Remove toml by @b-chu in https://github.com/mosaicml/composer/pull/2872
Update license by @b-chu in https://github.com/mosaicml/composer/pull/2875
Add ignore_metrics field to the MLflow logger by @ngcgarcia in https://github.com/mosaicml/composer/pull/2869
Convert print to log.info by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2876

New Contributors

@jerrychen109 made their first contribution in https://github.com/mosaicml/composer/pull/2802

Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.2...v0.18.0

composer - v0.18.0

Published by irenedea 9 months ago

What's New

1. Improved DTensor Support (#2821)

Enables elastic saving and loading of DTensors at various mesh sizes.

2. MLFlow Upload and Download (#2802,#2810)

Artifacts, such as checkpoints, can now be logged to Databricks-managed MLFlow.

composer_model = MyComposerModel(n_layers=3)

trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      save_folder='dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      loggers=MLFlowLogger(...),
      ...
)

Deprecations

Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2827

Bug Fixes

Fix load_ignore_keys with rng by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2803
Fix torch profiler error on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2818
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in https://github.com/mosaicml/composer/pull/2817
Remove duplicate checkpoint verifications by @eracah in https://github.com/mosaicml/composer/pull/2828
[S] Fix how single value tensors are logged by @aspfohl in https://github.com/mosaicml/composer/pull/2831
default to no torch profiler memory timeline by @cli99 in https://github.com/mosaicml/composer/pull/2790

What's Changed

Bump transformers version by @dakinggg in https://github.com/mosaicml/composer/pull/2781
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in https://github.com/mosaicml/composer/pull/2784
Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in https://github.com/mosaicml/composer/pull/2783
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in https://github.com/mosaicml/composer/pull/2785
[UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in https://github.com/mosaicml/composer/pull/2789
Enable system metrics in mosaic mlflow logger by @chenmoneygithub in https://github.com/mosaicml/composer/pull/2775
Update parse_uri by @irenedea in https://github.com/mosaicml/composer/pull/2787
default to no torch profiler memory timeline by @cli99 in https://github.com/mosaicml/composer/pull/2790
Add eot token to ICL generate kwargs by @bmosaicml in https://github.com/mosaicml/composer/pull/2782
Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in https://github.com/mosaicml/composer/pull/2791
Add torch nightly 12-13 by @j316chuck in https://github.com/mosaicml/composer/pull/2792
Add process group as arg to FSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2794
Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in https://github.com/mosaicml/composer/pull/2798
Fix load_ignore_keys with rng by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2803
Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in https://github.com/mosaicml/composer/pull/2806
Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in https://github.com/mosaicml/composer/pull/2805
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in https://github.com/mosaicml/composer/pull/2807
Avoid futures on close for MosaicML logger by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2804
Require sync module states with HSDP by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2812
Better communication computation overlap by @snarayan21 in https://github.com/mosaicml/composer/pull/2811
Improve error message for speed monitor by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2801
Bump torch version -- DO NOT RELEASE by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2814
Bump torchvision for nightly by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2815
Fix mosaicml logger on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2816
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in https://github.com/mosaicml/composer/pull/2817
Fix torch profiler error on close by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2818
Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in https://github.com/mosaicml/composer/pull/2822
All unshard streams wait on computation every step by @snarayan21 in https://github.com/mosaicml/composer/pull/2823
Add encoding=utf-8 by @dakinggg in https://github.com/mosaicml/composer/pull/2824
Fix import for daily test by @snarayan21 in https://github.com/mosaicml/composer/pull/2826
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in https://github.com/mosaicml/composer/pull/2802
Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2827
checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in https://github.com/mosaicml/composer/pull/2819
code-quality timeout update by @aspfohl in https://github.com/mosaicml/composer/pull/2830
[S] Fix how single value tensors are logged by @aspfohl in https://github.com/mosaicml/composer/pull/2831
Adds DTensor Support by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2821
Remove duplicate checkpoint verifications by @eracah in https://github.com/mosaicml/composer/pull/2828
Fix seed for FSDP wrap by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2833
Remove fsdp patch for comm overlap by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2836
Allow hsdp by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2838
Bump torch 2.1.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2840
Upgrade pyright to 1.1.310 by @b-chu in https://github.com/mosaicml/composer/pull/2841
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in https://github.com/mosaicml/composer/pull/2810
update nightly to torch 2.3 by @j316chuck in https://github.com/mosaicml/composer/pull/2842
Pin sphinxcontrib applehelp by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2854
Fix torch bump by @j316chuck in https://github.com/mosaicml/composer/pull/2855
Torch 2.3 patch by @dakinggg in https://github.com/mosaicml/composer/pull/2849
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in https://github.com/mosaicml/composer/pull/2866
Rewrite to use individual state functions by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2860
Add custom stopping criteria to ICL generate tasks by @bmosaicml in https://github.com/mosaicml/composer/pull/2800
Add save_ignore_keys by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2868
Remome log debug by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2871
Update monkeypatch to put barrier in optim load by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2874
Remove toml by @b-chu in https://github.com/mosaicml/composer/pull/2872
Update license by @b-chu in https://github.com/mosaicml/composer/pull/2875
Add ignore_metrics field to the MLflow logger by @ngcgarcia in https://github.com/mosaicml/composer/pull/2869
Convert print to log.info by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2876

New Contributors

@jerrychen109 made their first contribution in https://github.com/mosaicml/composer/pull/2802

Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.2...v0.18.0

composer - v0.17.2

Published by mvpatel2000 10 months ago

New Features

1. Torch 2.1.1 Support

Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.

2. Faster OCI Upload/Download

Composer now supports multi-part upload/download to OCI, which should speedup object store times.

3. Memory Profiling

We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.

Bug Fixes

1. FSDP Initialization with Meta

Previously, our FSDP integration had a bug with initializing weights when using device=meta, which resulted in an additional scaling. This has now been fixed, so device and distributed strategies should not affect parallelization strategy.

What's Changed

Override NVIDIA environment variable for CUDA 12.1 images by @bandish-shah in https://github.com/mosaicml/composer/pull/2742
Add NVIDIA_REQUIRE_CUDA_OVERRIDE env variable to Composer and Torch nightly Docker images by @bandish-shah in https://github.com/mosaicml/composer/pull/2744
Remove duplicated for loop in lr_monitor.py by @priba in https://github.com/mosaicml/composer/pull/2738
Fix console logger for small datasets. by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2746
Add metadata logging for wandb by @jjanezhang in https://github.com/mosaicml/composer/pull/2747
Ignore load ignore keys by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2748
Bump torch to 2.1.1 version by @j316chuck in https://github.com/mosaicml/composer/pull/2717
Add more info when run doesnt complete by @aspfohl in https://github.com/mosaicml/composer/pull/2751
Lower sequence generation length on code gen to be dependent on max canonical solution length by @bmosaicml in https://github.com/mosaicml/composer/pull/2682
Remove flatten params by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2761
Fix GPU tests by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2767
Fix GPU v2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2768
Use time.tokens for speedmonitor instead of dataset length by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2762
Remove BreakEpochException by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2759
time to clean up time parsing 😉 by @aspfohl in https://github.com/mosaicml/composer/pull/2770
Upgrade RunConfig compute specification by @aspfohl in https://github.com/mosaicml/composer/pull/2772
Use async logging in MLflowLogger by @chenmoneygithub in https://github.com/mosaicml/composer/pull/2693
Fix FSDP _param_init_fn to not reinit parameters multiple times by @dakinggg in https://github.com/mosaicml/composer/pull/2765
Gate FSDP param init test on torch 2.1 by @dakinggg in https://github.com/mosaicml/composer/pull/2774
Parallelize OCI multipart download by @coryMosaicML in https://github.com/mosaicml/composer/pull/2750
[UCVolumes] Add support for list API by @panchalhp-db in https://github.com/mosaicml/composer/pull/2769
Add the memory timeline profiling support through the PyTorch profiler. by @cli99 in https://github.com/mosaicml/composer/pull/2771
Improve torch memory profiling arguments processing by @cli99 in https://github.com/mosaicml/composer/pull/2777
Bump aws of nccl version and enable aws platform support by @willgleich in https://github.com/mosaicml/composer/pull/2776
Extend checkpoint loading to accept a validation function by @irenedea in https://github.com/mosaicml/composer/pull/2726
Fix checkpoint validation tests for torch 1.13 by @irenedea in https://github.com/mosaicml/composer/pull/2779
Bump version to 0.17.2 by @mvpatel2000 in https://github.com/mosaicml/composer/pull/2780

New Contributors

@chenmoneygithub made their first contribution in https://github.com/mosaicml/composer/pull/2693

Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.1...v0.17.2

Package Rankings

Top 2.02% on Pypi.org

Related Projects

Machine-Learning-Guide

Machine learning Guide. Learn all about Machine Learning Tools, Libraries, Frameworks, Large Lang...

17 Oct 2020 442

mmgeneration

MMGeneration is a powerful toolkit for generative models, based on PyTorch and MMCV.

14 Apr 2021 1,885

data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (...

01 Aug 2023 2,315

llm-foundry

LLM training code for Databricks foundation models

28 Apr 2023 3,899

mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

09 Jul 2020 3,194

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating poin...

20 Sep 2022 1,482

mmyolo

OpenMMLab YOLO series toolbox and benchmark. Implemented RTMDet, RTMDet-Rotated,YOLOv5, YOLOv6, Y...

18 Sep 2022 2,945

machine-learning-curriculum

Learn to make machines learn so that you don't have to struggle to program them; The ultimate list

12 Dec 2016 1,084

mmdeploy

OpenMMLab Model Deployment Framework

24 Dec 2021 2,729

mmagic

OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: ...

23 Aug 2019 6,566

mmengine

OpenMMLab Foundational Library for Training Deep Learning Models

08 Feb 2022 1,093

torchtune

A Native-PyTorch Library for LLM Fine-tuning

20 Oct 2023 3,775

serve

Serve, optimize and scale PyTorch models in production

03 Oct 2019 4,177

mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.

08 Jul 2020 5,674

the-incredible-pytorch

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relat...

11 Feb 2017 11,389

composer

What's New

1. Variable length dataloaders (#3416)

Bug Fixed

1. Remove close flush for mosaicml logger (#3446)

What's Changed

Bug Fixes

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

2. Add logging buffer time to MLFLow logger (#3401)

Bug Fixes

1. Only require databricks-sdk when on Databricks platform (#3389)

2. Skip extra dataset state load during job resumption (#3393)

3. Fix auto-microbatching on CUDA 12.4 (#3400)

4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/ prefix (#3410)

What's Changed

New Contributors

Bug Fixes

What's Changed:

What's New

What's Changed

What's New

Bug Fixes

What's Changed

New Contributors

What's New

🔥 Support for PyTorch v2.3.0

Bug Fixes

What's Changed

Bug Fixes

What's Changed

New Contributors

Bug Fixes

1. Enable torch 2.2.2 (#3161)

2. MPS Metric Computation on CPU (#3105)

3. Batch Sampler Support (#3105)

What's Changed

New Contributors

Bug Fixes

1. Fix to HSDP checkpoint loading

What's Changed

What's New

1. Aggregate Memory Monitoring (#3042)

2. Advanced Compression Options (#3118)

What's Changed

New Contributors

What's New

1. Torch 2.2.1 Support

What's Changed

What's New

1. New Neptune Logger

2. OOM observer callback with memory visualizations

3. Log all gpu rank stdout/err to MosaicML platform

Bug Fixes

What's Changed

New Contributors

What's New

What's Changed

What's New

1. Improved DTensor Support

2. Checkpoint Saving and Loading from Databricks MLFlow

3. Better Communication Computation Overlap in FSDP

4. Python3.11 + Torch2.2 Support

5. PEFT LoRA

6. Refactored Evaluation

7. Azure Checkpointing

8. MLFlow Checkpointing

Bug Fixes

What's Changed

New Contributors

Bug Fixes

What's Changed

Bug Fixes

What's Changed

New Contributors

New Features

1. Improved DTensor Support

2. Checkpoint Saving and Loading from Databricks MLFlow

Bug Fixes

Deprecations

1. Only require `databricks-sdk` when on Databricks platform (#3389)

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)