sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.

APACHE-2.0 License

Downloads
153.1K
Stars
490
Committers
45

Bot releases are visible (Hide)

sagemaker-training-toolkit - v4.8.1 Latest Release

Published by sagemaker-bot about 1 month ago

Bug Fixes and Other Changes

  • Added p5 as a supported NCCL instance
sagemaker-training-toolkit - v4.8.0

Published by sagemaker-bot 2 months ago

Features

  • Add support for py39 and py310

Bug Fixes and Other Changes

  • typo in the run unit tests command
  • run unit tests in sequence order for release process as well to prevent coverage conflicting issues
  • chore: removing unnecessary logging information
sagemaker-training-toolkit - v4.7.4

Published by sagemaker-bot 12 months ago

Bug Fixes and Other Changes

  • update the boto deps to use latest boto
sagemaker-training-toolkit - v4.7.3

Published by sagemaker-bot 12 months ago

Bug Fixes and Other Changes

  • bypass DNS check for studio local exec
sagemaker-training-toolkit - v4.7.2

Published by sagemaker-bot almost 1 year ago

Bug Fixes and Other Changes

  • use smddprun only if it is installed
sagemaker-training-toolkit - v4.7.1

Published by sagemaker-bot about 1 year ago

Bug Fixes and Other Changes

  • Add NCCL_PROTO=simple environment variable to handle the out-of-order data delivery from EFA
  • toolkit build failure
sagemaker-training-toolkit - v4.7.0

Published by sagemaker-bot about 1 year ago

Features

  • support codeartifact for installing requirements.txt packages
sagemaker-training-toolkit - v4.6.1

Published by sagemaker-bot over 1 year ago

Bug Fixes and Other Changes

  • removed unused import statment
  • forgot to run black on torch_distributed.py after updating my comments from last commit
  • Modified my comment on line 98-103 in torch_distrbuted.py to comply with formatting standard.
  • Revert "Ran black on entire sagemaker-trianing-toolkit directory"
  • Ran black on entire sagemaker-trianing-toolkit directory
  • Ran Black (python formatter) on the files with my code updates (torch_distributed.py and test_torch_distributed.py)
  • Added test for neuron_parallel_compile in test_torch_distributed.py
  • Updated comment syntax based on feedback in pull request as well as added full example of the neuron_parallel_compile command as it would appear in the command line
  • added unit test for neuron_parallel_compile code change
  • Updated torch_distributed.py
sagemaker-training-toolkit - v4.6.0

Published by sagemaker-bot over 1 year ago

Features

  • add smddp exception classes in mpi distribution
sagemaker-training-toolkit - v4.5.0

Published by sagemaker-bot over 1 year ago

Features

  • add NCCL_PROTO, NCCL_ALGO environments for modelparallel jobs
sagemaker-training-toolkit - v4.4.10

Published by sagemaker-bot over 1 year ago

Bug Fixes and Other Changes

  • unpin sagemaker version as the credential issue fixed
sagemaker-training-toolkit - v4.4.9

Published by sagemaker-bot over 1 year ago

Bug Fixes and Other Changes

  • increase worker waiting time for ORTE proc
sagemaker-training-toolkit - v4.4.8

Published by sagemaker-bot over 1 year ago

Bug Fixes and Other Changes

  • upagrade protobuf version for tensorflow 2.12
sagemaker-training-toolkit - v4.4.7

Published by sagemaker-bot over 1 year ago

Bug Fixes and Other Changes

  • Revert SMDDP collectives feature from smdataparallel runner
sagemaker-training-toolkit - v4.4.6

Published by sagemaker-bot over 1 year ago

sagemaker-training-toolkit - v4.4.5

Published by sagemaker-bot over 1 year ago

sagemaker-training-toolkit - v4.4.4

Published by sagemaker-bot over 1 year ago

Bug Fixes and Other Changes

  • Update libraries for SMDDP collectives validation
sagemaker-training-toolkit - v4.4.3

Published by sagemaker-bot almost 2 years ago

Bug Fixes and Other Changes

  • Upgrade protobuf to prevent conflicts with smdebugger.
sagemaker-training-toolkit - v4.4.2

Published by sagemaker-bot almost 2 years ago

sagemaker-training-toolkit - v4.4.1

Published by sagemaker-bot almost 2 years ago

Bug Fixes and Other Changes

  • Add support for p4de instances, update when FI_EFA_USE_DEVICE_RDMA flag is set to only p4d{e} instances.