sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.

APACHE-2.0 License

Downloads
153.1K
Stars
490
Committers
45

Bot releases are visible (Hide)

sagemaker-training-toolkit - v4.4.0

Published by sagemaker-bot almost 2 years ago

Features

  • integrate SMDDP collectives into smdataparallel runner
sagemaker-training-toolkit - v4.3.2

Published by sagemaker-bot almost 2 years ago

Bug Fixes and Other Changes

  • add general exception to filter
sagemaker-training-toolkit - v4.3.1

Published by sagemaker-bot almost 2 years ago

Bug Fixes and Other Changes

  • integrate upcoming dataparallel change to modelparallel
  • add unit tests for torchrun launcher and collections package deprecationWarning
sagemaker-training-toolkit - v4.3.0

Published by sagemaker-bot almost 2 years ago

Features

  • Add torch_distributed support for Trainium instances in SageMaker
sagemaker-training-toolkit - v4.2.10

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

    • feature: Add neuron cores support (#21)
sagemaker-training-toolkit - v4.2.9

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • Add SageMaker Debugger exceptions
sagemaker-training-toolkit - v4.2.8

Published by sagemaker-bot about 2 years ago

sagemaker-training-toolkit - v4.2.7

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • improve worker node wait logic and update EFA flags
sagemaker-training-toolkit - v4.2.6

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • Enable PT XLA distributed training on homogeneous clusters
sagemaker-training-toolkit - v4.2.5

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • relax exception type
sagemaker-training-toolkit - v4.2.4

Published by sagemaker-bot about 2 years ago

sagemaker-training-toolkit - v4.2.3

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • update num_processes_per_host for smdataparallel runner
sagemaker-training-toolkit - v4.2.2

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • Removed version hardcoding for sagemaker test dependency
  • update distribution_instance_group for pytorch ddp
  • specify flake8 config explicitly
sagemaker-training-toolkit - v4.2.1

Published by sagemaker-bot about 2 years ago

Bug Fixes and Other Changes

  • handle utf-8 decoding exceptions while processing stdout and stderr streams
sagemaker-training-toolkit - v4.2.0

Published by sagemaker-bot over 2 years ago

Features

  • Heterogeneous cluster changes
sagemaker-training-toolkit - v4.1.6

Published by sagemaker-bot over 2 years ago

Bug Fixes and Other Changes

  • update: protobuf version to overlap with TF requirements
sagemaker-training-toolkit - v4.1.5

Published by sagemaker-bot over 2 years ago

Bug Fixes and Other Changes

  • Fix none exception class issue for mpi
sagemaker-training-toolkit - v4.1.4

Published by sagemaker-bot over 2 years ago

Bug Fixes and Other Changes

  • Use framework provided error class and stack trace as error message
sagemaker-training-toolkit - v4.1.3

Published by sagemaker-bot over 2 years ago

sagemaker-training-toolkit - v4.1.2

Published by sagemaker-bot over 2 years ago

Bug Fixes and Other Changes

  • fix flaky issue with incorrect rc being given