composer

Supercharge Your Model Training

APACHE-2.0 License

Downloads
102.2K
Stars
5.1K
Committers
92
composer - v0.17.1

Published by mvpatel2000 11 months ago

Bug Fixes

1. MosaicML Logger Robustness (https://github.com/mosaicml/composer/pull/2728)

We've improved the MosaicML logger to be more robust to faulty serialization.

What's Changed

Full Changelog: https://github.com/mosaicml/composer/compare/v0.17.0...v0.17.1

composer - v0.17.0

Published by mvpatel2000 11 months ago

What's New

1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)

Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.

  composer_model = MyComposerModel(n_layers=3)

  fsdp_config = {
      'sharding_strategy': 'HYBRID_SHARD',
  }

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      fsdp_config=fsdp_config,
      ...
  )

HYBRID_SHARD will FULL_SHARD a model whereas _HYBRID_SHARD_ZERO2 will SHARD_GRAD_OP within the shard block.

2. Train Loss NaN Monitor (#2704)

Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.

  from composer.callbacks import NaNMonitor

  composer_model = MyComposerModel(n_layers=3)

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      callbacks=NaNMonitor(),
      ...
  )

Bug Fixes

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.16.4...v0.17.0

composer - v0.16.4

Published by mvpatel2000 about 1 year ago

What's New

1. Torch 2.1 Support

Composer officially supports PyTorch 2.1! We support several new features from 2.1, including CustomPolicy which supports granular wrapping with FSDP.

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.16.3...v0.16.4

composer - v0.16.3

Published by mvpatel2000 about 1 year ago

What's New

1. Add pass@k for HumanEval

HumanEval now supports pass@k. We also support first-class integration with the MosaicML platform for secure code evaluation.

2. log_model with MLFlow

The MLFlow integration now supports log_model at the end of the run.

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.16.2...v0.16.3

composer - v0.16.2

Published by mvpatel2000 about 1 year ago

What's New

1. PyTorch Nightly Support

Composer now supports PyTorch Nightly and Cuda 12! Along with new docker images based on nightly PyTorch versions and release candidates, we've updated our PyTorch monkeypatches to support the latest version of PyTorch. These monkeypatches add additional functionality in finer-grain FSDP wrapping and patch bugs related to sharded checkpoints. We are in the process of upstreaming these changes into PyTorch.

Bug Fixes

1. MosaicML Logger Robustness

MosaicML logger now is robust to platform timeouts and other errors. Additionally, it can now be disabled by setting the environment variable MOSAICML_PLATFORM to 'False' when training on the MosaicML platform.

2. GCS Integration

GCS authentication is now supported with HMAC keys, patching a bug in the previous implementation.

3. Optimizer Monitor Norm Calculation (https://github.com/mosaicml/composer/pull/2531)

Previously, the optimizer monitor incorrectly reduced norms across GPUs. It now correctly computes norms in a distributed setting.

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.16.1...v0.16.2

composer - v0.16.1

Published by mvpatel2000 about 1 year ago

New Features

1. HPU (Habana Gaudi) Support (https://github.com/mosaicml/composer/pull/2444)

Composer now supports Habana Gaudi chips! To enable HPUs, device needs to be specified as 'hpu':

composer_model = MyComposerModel(n_layers=3)

trainer = Trainer(
    model=composer_model,
    device='hpu',
    ...
)

2. Generate Callback (https://github.com/mosaicml/composer/pull/2449)

We've added a new callback which runs generate on a language model at a given frequency to visualize outputs:

from composer.callbacks import Generate

composer_model = MyComposerModel(n_layers=3)
generate_callback = Generate(prompts=['How good is my model?'], interval='5ba')

trainer = Trainer(
    model=composer_model,
    callbacks = generate_callback,
    ...
)

Bug Fixes

1. Checkpoint Fixes

Elastic sharded checkpointing now disables torchmetric saving to avoid issues with torchmetrics tensors being sharded. Additionally, checkpointing now falls back on the old path which does not convert torchmetrics tensors to numpy. Checkpointing also no longer materializes optimizer state when saving weights only.

2. MLFlow Performance Improvements

MLFlow integration has significant performance improvements in logging frequency and system metrics collected.

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.16.0...v0.16.1

composer - v0.16.0

Published by mvpatel2000 about 1 year ago

What's New

1. New Events (#2264)

Composer now has the events EVAL_BEFORE_ALL and EVAL_AFTER_ALL, which lets users control logging of certain bespoke evaluation information across all evalutors.

2. Elastic Sharded Checkpointing

Traditionally, checkpoints are stored as giant monoliths. For large model training, moving the entire model to 1 node may be infeasible and writing one large file from 1 node may be slow. Composer now supports elastic sharded checkpoints with FSDP, where every rank writes a single shard of the checkpoint. This checkpointing strategy is elastic, which means even if you resume on a different number of GPUs, Composer will handle resumption. To enable sharded checkpointing, it must be specified in the FSDP Config as 'state_dict_type': 'sharded':

composer_model = MyComposerModel(n_layers=3)

fsdp_config = {
    'sharding_strategy': 'FULL_SHARD',
    'state_dict_type': 'sharded',
    'sharded_ckpt_prefix_dir': 'ba{batch}-shards' # will save each set of shards checkpoint to a unique folder based on batch
}

trainer = Trainer(
    model=composer_model,
    max_duration='4ba'
    fsdp_config=fsdp_config,
    save_folder='checkpoints',
    save_interval='2ba',
    ...
)

See the docs for more information in how to integrate this with your project.

Bug Fixes

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.15.0...v0.16.0

composer - v0.15.1

Published by dakinggg over 1 year ago

Bug Fixes

This is a patch release that mainly fixes a bug related to autoresume, and changes the default to offload_to_cpu for PyTorch version >2 sharded checkpoints.

What's Changed

Full Changelog: https://github.com/mosaicml/composer/compare/v0.15.0...v0.15.1

composer - v0.15.0

Published by mvpatel2000 over 1 year ago

🚀 Composer v0.15.0

What's New

  1. Exact Eval (https://github.com/mosaicml/composer/pull/2218)

    Composer now supports exact evaluation! Now, evaluation will give the exact same results regardless of the number of GPUs by removing any duplicated samples from the dataloader.

  2. Monolithic Checkpoint Loading (https://github.com/mosaicml/composer/pull/2288)

    When training large models, loading the model and optimizer on every rank can use up all the system memory. With FSDP, Composer can now load the model and optimizer on only rank 0 and broadcast it to all other ranks. To enable:

    from composer import Trainer
    
    # Construct Trainer
    trainer = Trainer(
       ...,
       fsdp_config={
          load_monolith_rank0_only: True
       },
    )
    
    # Train!
    trainer.fit()
    

    and ensure the model on rank 0 is on CPU/GPU (as opposed to meta).

  3. Spin Dataloaders

    By default, Composer spins dataloaders back to the current timestamp to ensure deterministic resumption. However, dataloader spinning can be very slow, so Trainer now has a new flag to disable spinning if determinism is not required. To enable:

    from composer import Trainer
    
    # Construct Trainer
    trainer = Trainer(
       ...,
       spin_dataloaders=False,
    )
    
    # Train!
    trainer.fit()
    

Deprecations

  • HealthChecker is now deprecated and will be removed in v0.17.0

Bug Fixes

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.14.1...v0.15.0

composer - v0.14.1

Published by mvpatel2000 over 1 year ago

Bug Fixes

Fixes a bug related to sentpiece tokenizers and ICL eval.

What's Changed

Full Changelog: https://github.com/mosaicml/composer/compare/v0.14.0...v0.14.1

composer - v0.14.0

Published by bandish-shah over 1 year ago

🚀 Composer v0.14.0

Composer v0.14.0 is released! Install via pip:

pip install composer==0.14.0

The legacy package name still works via pip:

pip install mosaicml==0.14.0

New Features

  1. 🆕 PyTorch 2.0 Support (#2172)

    We're thrilled to announce official support for PyTorch 2.0! We've got all initial unit tests passing and run through our examples. We've also made some updates to start taking advantage of all the great new features.

    Initial support also includes:

    • Support for torch.compile

      Model Dataset Without compile thoughput/samples_per_sec With compile thoughput/samples_per_sec Performance %
      ResNet50 ImageNet 5557 7424 33.60%
      DeepLab V3 ADE20K 81.60 98.82 21.10%
      HF BERT C4 3360 4259 26.75%
      HF Causal LM C4 50.61 103.29 100.05%

      To start using, simply add compile_config argument to the Trainer:

        # To use default `torch.compile` config
        trainer = Trainer(
           ...,
           compile_config={},
        )
      
        # To use custom `torch.compile` config, provide an argument as a dictionary, for example:
        trainer = Trainer(
           ...,
           compile_config={'mode': 'reduce-overhead'},
        )
        
      

      The Trainer also supports pre-compiled models passed via the models argument. If the model has been pre-compiled, the compile_config argument is ignored if provided.

      Note: We recommend baselining your model with and without torch.compile as there are scenarios where enabling compile does not yield any throughput improvements and in some cases where this can lead to a regression.

    • PyTorch 2.0 Docker Images

      We've added the following new official MosaicML Docker Images with PyTorch 2.0 support:

      Linux Distro Flavor PyTorch Version CUDA Version Python Version Docker Tags
      Ubuntu 20.04 Base 2.0.0 11.7.1 (Infiniband) 3.10 mosaicml/pytorch:2.0.0_cu117-python3.10-ubuntu20.04
      Ubuntu 20.04 Base 2.0.0 11.7.1 (EFA) 3.10 mosaicml/pytorch:2.0.0_cu117-python3.10-ubuntu20.04-aws
      Ubuntu 20.04 Base 2.0.0 cpu 3.10 mosaicml/pytorch:2.0.0_cpu-python3.10-ubuntu20.04
      Ubuntu 20.04 Vision 2.0.0 11.7.1 (Infiniband) 3.10 mosaicml/pytorch_vision:2.0.0_cu117-python3.10-ubuntu20.04
      Ubuntu 20.04 Vision 2.0.0 cpu 3.10 mosaicml/pytorch_vision:2.0.0_cpu-python3.10-ubuntu20.04
  2. 🦾 New Callbacks

    • Activation monitor (#2066)

      Monitors activations in the network. Every interval batches it will attach a forwards hook and logs the max, average, l2 norm, and kurtosis for the input and output activations. To enable:

      from composer import Trainer
      from composer.callbacks import ActivationMonitor
      
      # Construct Trainer
      trainer = Trainer(
         ...,
         callbacks=[ActivationMonitor()],
      )
      
      # Train!
      trainer.fit()
      
    • Slack Logger (#2133)

      You can now send custom training metrics using Slack! To enable:

      from composer import Trainer
      from composer.loggers import SlackLogger
      
      transform = transforms.Compose([transforms.ToTensor()])
      
      
      trainer = Trainer(
         ...
         loggers=[
             SlackLogger(
                 log_interval="10ba", # or 1ep, 2ep 
                 include_keys=["algorithm_traces*", "loss*"],
                 formatter_func=(lambda data, **kwargs:
                    [
                        {
                            "type": "section", "text": {"type": "mrkdwn", "text": f"*{k}:* {v}"}
                        }
                        for k, v in data.items()
                    ])
             )
         ],
      )
      
      trainer.fit()
      

      Please see PR #2133 for additional details.

API changes

  • The grad_accum argument has been removed from Trainer, users are now required to use device_train_microbatch_size instead (#2040)

Deprecations

  • We no longer support PyTorch 1.11 and 1.12 due to security vulnerabilities. New features will not be tested against these versions.

Bug Fixes

  • Eval subset num batches bug fix (#2028)
  • Protect for missing slack_sdk import (#2031)
  • Adjust HuggingFaceModel token embedding resizing to only occur when necessary (#2027)
  • Update FSDP meta weight tying tests to include precision testing (#2050)
  • Backward Compat with Torchmetrics (#2046)
  • Busy wait for local rank 0 download to avoid timeout on large file download (#2054)
  • Fix OCIObjectStore save_overwrite=False bug (#2053)
  • Busy wait so that non local rank zeros don't timeout while local rank zero downloads a monolithic checkpoint (#2071)
  • Skip extra downloads when not using a format string (#2073)
  • fix name_or_path usage in HF save/load usage (#2075)
  • Fix EMA resumption issue with calling trainer.eval() before trainer.fit() (#2088)
  • Patch EMA with FSDP (#2091)
  • Updating gradient clipping to be torch 2.0 compatible (#2089)
  • Adding checks for weight tying s.t. we don't think None attributes are weight tied (#2103)
  • gate the extra forward call specifically for fsdp (#2102)
  • Allow user to set ONNX opset version when Exporting for Inference (#2101)
  • Runtime estimator (#2124)
  • Use state_dict Torchmetrics Serialization (#2116)
  • Fix filelock in checkpoint download (#2184)

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.13.5...v0.14.0

composer - v0.13.5

Published by mvpatel2000 over 1 year ago

Full Changelog: https://github.com/mosaicml/composer/compare/v0.13.4...v0.13.5

  • Add support for EMA + FSDP
composer - v0.13.4

Published by mvpatel2000 over 1 year ago

Full Changelog: https://github.com/mosaicml/composer/compare/v0.13.3...v0.13.4

Bumps streaming version pin to <1.0

composer - v0.13.3

Published by bandish-shah over 1 year ago

🚀 Composer v0.13.3

Introducing the composer PyPi package!

Composer v0.13.3 is released!

Composer can also now be installed using the new composer PyPi package via pip:

pip install composer==0.13.3

The legacy package name still works via pip:

pip install mosaicml==0.13.3

Bug Fixes

  • add sentencepiece support by @dakinggg in #2093

What's Changed

  • Bump version to 0.13.3 by @bandish-shah in #2115
  • add missing import by @dakinggg in #2113
  • add sentencepiece support by @dakinggg in #2093
  • Pin mcli version until API change is resolved by @dakinggg in #2111

Full Changelog: https://github.com/mosaicml/composer/compare/v0.13.2...v0.13.3

composer - v0.13.2

Published by bandish-shah over 1 year ago

🚀 Composer v0.13.2

Introducing the composer PyPi package!

Composer v0.13.2 is released!

Composer can also now be installed using the new composer PyPi package via pip:

pip install composer==0.13.2

The legacy package name still works via pip:

pip install mosaicml==0.13.2

Bug Fixes

  • test and fix composer package name usage in composer_collect_env (#2049)
  • Backward Compat with Torchmetrics by @mvpatel2000 (#2046)
  • Fix OCIObjectStore save_overwrite=False bug (#2053)
  • busy wait for the rank 0 download (#2071)
  • Skip extra downloads when not using a format string (#2073)

What's Changed

  • Pin transformers package to <4.27 by @dakinggg in #2076
  • Bump version to v0.13.2 (#2068) by @bandish-shah
  • Skip extra downloads when not using a format string by @dakinggg in #2073
  • add support for autoresume + FSDP + sharding by @dakinggg in #2072
  • busy wait for the rank 0 download by @dakinggg in #2071
  • Revert "Checkpoints Simplified (#2059)" by @dakinggg in #2070
  • Add device and dtype back to LPLayerNorm (#2067) by @abhi-mosaic
  • Checkpoints Simplified by @mvpatel2000 in #2059
  • Allow LPLayerNorm and LPGroupNorm to support self.bias or self.weight = None (#2044) by @abhi-mosaic
  • Add NO_REENTRANT activation checkpointing (#2042) by @bmosaicml
  • pin torchmetrics by @mvpatel2000 in #2065
  • Update docs with non-rank zero logs instructions by @hanlint in #2058
  • Fix OCIObjectStore save_overwrite=False bug by @eracah in #2053
  • Busy wait for local rank 0 download to avoid timeout on large file download by @dakinggg in #2054
  • Raise error if attempting to export FSDP model by @hanlint in #2051
  • Revert "Checkpoints Simplified (#2041)" by @dakinggg in #2056
  • Delete composer package GPU workflow by @dakinggg in #2055
  • Add composer PyPI package tests to daily workflow (#2052) by @bandish-shah
  • Checkpoints Simplified by @mvpatel2000 in #2041
  • update fsdp mixed precision by @vchiley in #2047
  • Backward Compat with Torchmetrics by @mvpatel2000 in #2046
  • Update FSDP meta weight tying tests to include precision testing by @bcui19 in #2050
  • Log nodename information in composer by @eracah in #2043
  • test and fix composer package name usage in composer_collect_env by @dakinggg in #2049
  • Adjust how HuggingFaceModel handles embedding resizing by @dakinggg in #2027
  • Adds a PR guidelines section to contributing.md by @dakinggg in #1993
  • Bump pypandoc from 1.10 to 1.11 (#2038) by @dependabot[bot]
  • Bump pytest from 7.2.1 to 7.2.2 (#2039) by @dependabot[bot]
  • Use follow in mcp script by @mvpatel2000 in #2022

Full Changelog: https://github.com/mosaicml/composer/compare/v0.13.1...v0.13.2

composer - v0.13.1

Published by bandish-shah over 1 year ago

🚀 Composer v0.13.1

Introducing the composer PyPi package!

Composer v0.13.1 is released!

Composer can also now be installed using the new composer PyPi package via pip:

pip install composer==0.13.1

The legacy package name still works via pip:

pip install mosaicml==0.13.1

Note: The mosaicml==0.13.0 PyPi package was yanked due to some minor packaging issues discovered after release. The package was re-released as Composer v0.13.1, thus these release notes contain details for both v0.13.0 and v0.13.1.

New Features

  1. 🤙 New and Updated Callbacks

    • New HealthChecker Callback (#2002)

      The callback will log a warning if the GPUs on a given node appear to be in poor health (low utilization). The callback can also be configured to send a Slack message!

      from composer import Trainer
      from composer.callbacks import HealthChecker
      
      # Warn if GPU utilization difference drops below 10%
      health_checker = HealthChecker(
          threshold = 10
      )
      
      # Construct Trainer
      trainer = Trainer(
          ...,
          callbacks=health_checker,
      )
      
      # Train!
      trainer.fit()
      
    • Updated MemoryMonitor to use GigaBytes (GB) units (#1940)

    • New RuntimeEstimator Callback (#1991)

      Estimate the remaining runtime of your job! Approximates the time remaining by observing the throughput and comparing to the number of batches remaining.

      from composer import Trainer
      from composer.callbacks import RuntimeEstimator
      
      # Construct trainer with RuntimeEstimator callback
      trainer = Trainer(
          ...,
          callbacks=RuntimeEestimator(),
      )
      
      # Train!
      trainer.fit()
      
    • Updated SpeedMonitor throughput metrics (#1987)

      Expands throughput metrics to track relative to several different time units and per device:

      • throughput/batches_per_sec and throughput/device/batches_per_sec
      • throughput/tokens_per_sec and throughput/device/tokens_per_sec
      • throughput/flops_per_sec and throughput/device/flops_per_sec
      • throughput/device/samples_per_sec

      Also adds throughput/device/mfu metric to compute per device MFU. Simply enable the SpeedMonitor callback per usual to log these new metrics! Please see SpeedMonitor documentation for more information.

  2. ⣿ FSDP Sharded Checkpoints (#1902)

    Users can now specify the state_dict_type in the fsdp_config dictionary to enable sharded checkpoints. For example:

    from composer import Trainer
    
    fsdp_confnig = {
        'sharding_strategy': 'FULL_SHARD',
        'state_dict_type': 'local',
    }
    
    trainer = Trainer(
        ...,
        fsdp_config=fsdp_config,
        save_folder='checkpoints',
        save_filename='ba{batch}_rank{rank}.pt',
        save_interval='10ba',
    )
    

    Please see the PyTorch FSDP docs and Composer's Distributed Training notes for more information.

  3. 🤗 HuggingFace Improvements

    • Update HuggingFaceModel class to support encoder-decoder batches without decoder_input_ids (#1950)
    • Allow evaluation metrics to be passed to HuggingFaceModel directly (#1971)
    • Add a utility function to load a Composer checkpoint of a HuggingFaceModel and write out the expected config.json and pytorch_model.bin in the HuggingFace pretrained folder (#1974)
  4. 🛟 Nvidia H100 Alpha Support - Added amp_fp8 data type

    In preparation for H100's arrival, we've added the amp_fp8 precision type. Currently setting amp_fp8 specifies a new precision context using transformer_engine.pytorch.fp8_autocast. For more details, please see Nvidia's new Transformer Engine and the specific fp8 recipe we utilize.

    from composer import Trainer
    
    trainer = Trainer(
        ...,
        precision='amp_fp8',
    )
    

API changes

  • The torchmetrics package has been upgraded to 0.11.x.

    The torchmetrics.Accuracy metric now requires a task argument which can take on a value of binary, multiclass or multilabel. Please see Torchmetrics Accuracy docs for details.

    Additonally, since specifying value='multiclass' requires an additional field of num_classes to be specified, we've had to update ComposerClassifier to accept the additional num_classes argument. Please see PR's #2017 and #2025 for additional details

  • Surgery algorithms used in functional form return a value of None (#1543)

Deprecations

  • Deprecate HFCrossEntropy and Perplexity (#1857)
  • Remove Jenkins CI (#1943, #1954)
  • Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers (#1846)

Bug Fixes

  • Fixed an issue introduced in 0.12.1 where HuggingFaceModel crashes if config.return_dict = False (#1948)
  • Refactor EMA to improve memory efficiency (#1941)
  • Make wandb checkpoint logging compatible with wandb model registry (#1973)
  • Fix ICL race conditions (#1978)
  • Update epoch metric name to trainer/epoch (#1986)
  • reset scaler (#1999)
  • Bug/sync optimization logger across ranks (#1970)
  • Update Docker images to fix resolve vulnerability scan issues (#2007)
  • Fix eval duplicate logging issue (#2018)
  • extend test and patch bug (#2028)
  • Protect for missing slack_sdk import (#2031)

Known Issues

  • Docker Image Security Vulnerability
    • CVE-2022-45907: The mosaicml/pytorch:1.12.1*, mosaicml/pytorch:1.11.0*, mosaicml/pytorch_vision:1.12.1* and mosaicml/pytorch_vision:1.11.0* images are impacted and currently supported for legacy use cases. We recommend users upgrade to images with PyTorch >1.13. The affected images will be removed in the next Composer release.

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.12.1...v0.13.1

composer - v0.13.0

Published by bandish-shah over 1 year ago

This release has been yanked due to a minor packaging issue, please skip directly to Composer v0.13.1

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.12.1...v0.13.0

composer - v0.12.1

Published by bandish-shah over 1 year ago

🚀 Composer v0.12.1

Composer v0.12.1 is released! Install via pip:

pip install --upgrade mosaicml==0.12.1

New Features

  1. 📚 In-Context Learning (#1876)

    With Composer and MosaicML Cloud you can now evaluate LLMs on in-context learning tasks (LAMBADA, HellaSwag, PIQA, and more) hundreds of times faster than other evaluation harnesses. Please see our "Blazingly Fast LLM Evaluation for In-Context Learning" blog post for more details!

  2. 💾 Added support for Coreweave Object Storage (#1915)

    Coreweave object store is compatible with boto3. Uploading objects to Coreweave object store is almost exactly like writing to using S3, except an endpoint_url must be set via the S3_ENDPOINT_URLenvironment variable. For example:

    import os
    os.environ['S3_ENDPOINT_URL'] = 'https://object.las1.coreweave.com'
    
    from composer.trainer import Trainer
    
    # Save checkpoints every epoch to s3://my_bucket/checkpoints
    trainer = Trainer(
        model=model,
        train_dataloader=train_dataloader,
        max_duration='10ep',
        save_folder='s3://my_bucket/checkpoints',
        save_interval='1ep',
        save_overwrite=True,
        save_filename='ep{epoch}.pt',
        save_num_checkpoints_to_keep=0,  # delete all checkpoints locally
     )
    
     trainer.fit()
    

    Please see our checkpointing documentation for more details.

  3. 🪵 Automatic logging of Trainer hparams (#1855)

    Hyperparameter arguments passed to the Trainer are now automatically logged. Simply set the Trainer argument auto_log_hparams=True.

Bug Fixes

  • Update Docker images to use ‘posix_prefix’ paths (#1854)
  • Disable new notebook in CI (#1875)
  • [Fix] Enable logging of metrics from Callbacks to ConsoleLogging (#1884)
  • Ensure loggers run init event before callbacks in Engine (#1890)
  • Raise an error in FSDP meta tensor initialization if there's no initialization functions, fix associated flaky FSDP test (#1905)
  • Add primitive list support (#1906)
  • Add logic for shifting labels before computing metrics (#1913)
  • Fixes mis specified dependency (#1919)
  • pin setuptools in build requirements (#1926)
  • Pin pip<23 in Docker images (#1936)
  • Fix bug in trainer.eval and add test cases for test_console_logger (#1937)

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.12.0...v0.12.1

composer - v0.12.0

Published by bandish-shah almost 2 years ago

🚀 Composer v0.12.0

Composer v0.12.0 is released! Install via pip:

pip install mosaicml==0.12.0

New Features

  1. 🪵 Logging and ObjectStore Enhancements

    There are multiple improvements to our logging and object store support in this release.

    • Image visualization using our CometMLLogger (#1710)

      We've added support for using our ImageVisualizer callback with CometML to log images and segmentation masks to CometML.

      from composer.trainer import Trainer
      
      trainer = Trainer(...,
          callbacks=[ImageVisualizer()],
          loggers=[CometMLLogger()]
      )
      
    • Added direct support for Oracle Cloud Infrastructure (OCI) as an ObjectStore (#1774) and support for Google Cloud Storage (GCS) via URI (#1833)

      To use, you can simply set your save_folder or load_path to a URI beginning with oci:// or gs://, to save and load with OCI and GCS respectively.

      from composer.trainer import Trainer
      
      # Checkpoint saving to Google Cloud Storage.
      trainer = Trainer(
          model=model,
          save_folder="gs://my-bucket/{run_name}/checkpoints",
          run_name='my-run',
          save_interval="1ep",
          save_filename="ep{epoch}.pt",
          save_num_checkpoints_to_keep=0,  # delete all checkpoints locally
          ...
      )
      
      trainer.fit()
      
    • Added basic support for logging with MLFlow (#1795)

      We've added basic support for using MLFlow to log experiment metrics.

      from composer.loggers import MLFlowLogger
      from composer.trainer import Trainer
      
      mlflow_logger = MLFlowLogger(experiment_name=mlflow_exp_name,
                                   run_name=mlflow_run_name,
                                   tracking_uri=mlflow_uri)
      trainer = Trainer(..., loggers=[mlflow_logger])
      
    • Simplified console and progress bar logging (#1694)

      To turn off the progress bar, set progress_bar=False. To turn on logging directly to the console, set log_to_console=True. To control the frequency of logging to console, set console_log_interval (e.g. to 1ep or 1ba).

    • getfile supports URIs (#1750)

      Our get_file utility now supports URIs directly (s3://, oci://, and gs://) for downloading files.

  2. 🏃‍♀️ Support for Mid-Epoch Resumption with the latest release of Streaming

    We've added support in Composer for the latest release of our Streaming library. This includes awesome new features like instant mid epoch resumption and deterministic shuffling, regardless of the number of nodes. See the Streaming release notes for more!

  3. 🚨 New algorithm - GyroDropout!

    Thanks to @jelite for adding a new algorithm, GyroDropout to Composer! Please see the method card for more details.

  4. 🤗 HuggingFace + Composer improvements

    We've added a new utility to load a 🤗 HuggingFace model and tokenizer out of a Composer checkpoint (#1754), making the pretraining -> finetuning workflow even easier in Composer. Check out the docs for more details, and our example notebook for a full tutorial (#1775)!

  5. 🎓 GradMonitor -> OptimizerMonitor

    Renames our GradMonitor callback to OptimizerMonitor, and adds the ability to track optimizer specific metrics. Check out the docs for more details, and add to your code just like any other callback!

    from composer.callbacks import OptimizerMonitor
    from composer.trainer import Trainer
    
    trainer = Trainer(
        ..., 
        callbacks=[OptimizerMonitor(log_optimizer_metrics=log_optimizer_metrics)]
    )
    
  6. 🐳 New PyTorch and CUDA versions

    We've expanded our library of Docker images with support for PyTorch 1.13 + CUDA 11.7:

    • mosaicml/pytorch:1.13.0_cu117-python3.10-ubuntu20.04
    • mosaicml/pytorch:1.13.0_cpu-python3.10-ubuntu20.04

    The mosaicml/pytorch:latest, mosaicml/pytorch:cpu_latest and mosaicml/composer:0.12.0 tags are now built from PyTorch 1.13 based images. Please see our DockerHub repository for additional details.

API changes

  1. Replace grad_accum with device_train_microbatch_size (#1749, #1776)

    We're deprecating the grad_accum Trainer argument in favor of the more intuitive device_train_microbatch_size. Instead of thinking about how to divide your specified minibatch into microbatches, simply specify the size of your microbatch. For example, let's say you want to split your minibatch of 2048 into two microbatches of 1024:

    from composer import Trainer
    
    trainer = Trainer(
        ...,
        device_train_microbatch_size=1024,
    )
    

    If you want Composer to tune the microbatch for you automatically, enable automatic microbatching as follows:

    from composer import Trainer
    
    trainer = Trainer(
        ...,
        device_train_microbatch_size='auto',
    )
    

    The grad_accum argument is still supported but will be deprecated in the next Composer release.

  2. Renamed precisions (#1761)

    We've renamed precision attributes for clarity. The following values have been removed: ['amp', 'fp16', bf16'].

    We have added the following values, prefixed with 'amp' to clarify when an Automatic Mixed Precision type is being used: ['amp_fp16', 'amp_bf16'].

    The fp32 precision value remains unchanged.

Deprecations

  1. Removed support for YAHP (#1512)
  2. Removed COCO and SSD datasets (#1717)
  3. Fully removed Streaming v1 support, please see the mosaicml/streaming project for our next-gen streaming datasets (#1787)
  4. Deprecated FusedLayerNorm algorithm (#1789)
  5. Fully removed grad_clip_norm training argument, please use the GradientClipping algorithm instead (#1768)
  6. Removed data_fit, data_epoch, and data_batch from Logger (#1826)

Bug Fixes

  • Fix FSDP checkpoint strategy (#1734)
  • Fix gradient clipping with FSDP (#1740)
  • Adds more supported FSDP config flags (sync_module_states, forward_prefecth, limit_all_gathers) (#1794)
  • Allow FULL precision with FSDP (#1796)
  • Fix eval_microbatch modification on EVAL_BEFORE_FORWARD event (#1739)
  • Fix algorithm API backwards compatibility in checkpoints (#1741)
  • Fixes a bad None check preventing setting device_id to 0 (#1767)
  • Unregister engine to make cleaning up memory easier (#1769)
  • Fix issue if metric_names is not a list (#1798)
  • Match implementation for list and tensor batch splitting (#1804)
  • Fixes infinite eval issue (#1815)

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/composer/compare/v0.11.1...v0.12.0

composer - v0.11.1

Published by bandish-shah almost 2 years ago

🚀 Composer v0.11.1

Composer v0.11.1 is released! Install via pip:

pip install --upgrade mosaicml==0.11.1

Bug Fixes

  • Fixes for Notebooks (#1659)
  • Documentation updates and fixes (#1685, #1696, #1702, #1709)
  • Addressed warnings and speed improvements for Torchmetrics (#1674)
  • Fixes to Gated Linear Units method (#1575, #1689)
  • Set NCCL_ASYNC_ERROR_HANDLING ENV variable in Composer launcher to enable distributed timeout (#1695)
  • Fix epoch count when eval is called before fit (#1697)
  • Constrain PyTorch package versions to avoid unintended upgrades (#1688)
  • Fix Optimizer state sharding issue with FSDP (#1732)
  • Rase ValueError with if evaluation dataloader of infinite length is specified

Full Changelog: https://github.com/mosaicml/composer/compare/v0.11.0...v0.11.1

Package Rankings
Top 2.02% on Pypi.org
Related Projects