sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.

APACHE-2.0 License

Downloads
153.1K
Stars
490
Committers
45

Bot releases are visible (Hide)

sagemaker-training-toolkit - v4.1.1

Published by sagemaker-bot over 2 years ago

Bug Fixes and Other Changes

  • missing args when shell script is used
sagemaker-training-toolkit - v4.1.0

Published by sagemaker-bot over 2 years ago

Features

  • add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22
sagemaker-training-toolkit - v4.0.1

Published by sagemaker-bot over 2 years ago

sagemaker-training-toolkit - v4.0.0

Published by sagemaker-bot about 3 years ago

Breaking Changes

  • Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 (changes from PR #108)
sagemaker-training-toolkit - v3.9.3

Published by sagemaker-bot about 3 years ago

Bug Fixes and Other Changes

  • Fix logging issues
sagemaker-training-toolkit - v3.9.2

Published by sagemaker-bot over 3 years ago

Bug Fixes and Other Changes

  • Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Dataloaders for Distributed training
sagemaker-training-toolkit - v3.9.1

Published by sagemaker-bot over 3 years ago

Bug Fixes and Other Changes

  • [smdataparallel] better messages to establish the SSH connection between workers
sagemaker-training-toolkit - v3.9.0

Published by sagemaker-bot over 3 years ago

Features

  • smdataparallel enable EFA RDMA flag
sagemaker-training-toolkit - v3.8.0

Published by sagemaker-bot over 3 years ago

Features

  • smdataparallel custom mpi options support
sagemaker-training-toolkit - v3.7.5

Published by sagemaker-bot over 3 years ago

sagemaker-training-toolkit - v3.7.4

Published by sagemaker-bot over 3 years ago

Bug Fixes and Other Changes

  • Update Dockerfile to accomomdate Rust dependency.
sagemaker-training-toolkit - v3.7.3

Published by sagemaker-bot over 3 years ago

Bug Fixes and Other Changes

  • set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages
sagemaker-training-toolkit - v3.7.2

Published by sagemaker-bot almost 4 years ago

Bug Fixes and Other Changes

  • set btl_vader_single_copy_mechanism to none
sagemaker-training-toolkit - v3.7.1

Published by sagemaker-bot almost 4 years ago

Bug Fixes and Other Changes

  • decode binary stderr string before dumping it out
sagemaker-training-toolkit - v3.7.0

Published by sagemaker-bot almost 4 years ago

Features

  • add data parallelism support (#3)

Bug Fixes and Other Changes

  • update tox to use sagemaker 2.18.0 for tests
  • use format in place of f-strings and use comment style type annotations
sagemaker-training-toolkit - v3.6.4

Published by sagemaker-bot almost 4 years ago

Bug Fixes and Other Changes

  • workaround to print stderr when capturing

Testing and Release Infrastructure

  • use ECR-hosted image for ubuntu:16.04
sagemaker-training-toolkit - v3.6.3.post0

Published by sagemaker-bot almost 4 years ago

Documentation Changes

  • fix typo in ENVIRONMENT_VARIABLES.md
sagemaker-training-toolkit - v3.6.3

Published by sagemaker-bot almost 4 years ago

Bug Fixes and Other Changes

  • propagate log level to aws services
sagemaker-training-toolkit - v3.6.2

Published by sagemaker-bot about 4 years ago

Bug Fixes and Other Changes

  • check for script entry point even if setup.py is present
sagemaker-training-toolkit - v3.6.1.post1

Published by sagemaker-bot about 4 years ago

Testing and Release Infrastructure

  • pin sagemaker<2 in test dependencies