dstack is an open-source orchestration engine for cost-effectively running AI workloads in the cloud as well as on-premises. Discord: https://discord.gg/u8SmfwPpMd
MPL-2.0 License
Bot releases are hidden (Show)
Full changelog: https://github.com/dstackai/dstack/compare/0.18.11...0.18.12
Published by un-def about 2 months ago
Full changelog: https://github.com/dstackai/dstack/compare/0.18.11...0.18.12rc1
Published by peterschmidt85 about 2 months ago
With the latest update, you can now specify an AMD GPU under resources
. Below is an example.
type: service
name: amd-service-tgi
image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
- TRUST_REMOTE_CODE=true
- ROCM_USE_FLASH_ATTN_V2_TRITON=true
commands:
- text-generation-launcher --port 8000
port: 8000
resources:
gpu: MI300X
disk: 150GB
spot_policy: auto
model:
type: chat
name: meta-llama/Meta-Llama-3.1-70B-Instruct
format: openai
[!NOTE]
AMD accelerators are currently supported only with therunpod
backend. Support for on-prem fleets and more backends
is coming soon.
The gpu
property now accepts the vendor
attribute, with supported values: nvidia
, tpu
, and amd
.
Alternatively, you can also prefix the GPU name with the vendor name followed by a colon, for example: tpu:v2-8
or amd:192GB
, etc. This change ensures consistency in GPU requirements configuration across vendors.
dstack
now supports encryption of sensitive data, such as backend credentials, user tokens, etc. Learn more on the reference page.
By default, the dstack
server stores run logs in ~/.dstack/server/projects/<project name>/logs
. To store logs in AWS CloudWatch, set the DSTACK_SERVER_CLOUDWATCH_LOG_GROUP environment variable.
With this update, it's now possible to assign any user as a project manager. This role grants permission to manage project users but does not allow management of backends or resources.
By default, all users can create and manage their own projects. If you want only global admins to create projects, add the following to ~/.dstack/server/config.yml
:
default_permissions:
allow_non_admins_create_projects: false
vendor
property under resources.gpu
@un-def in https://github.com/dstackai/dstack/pull/1558
manager
project role @olgenn in https://github.com/dstackai/dstack/pull/1566
gpu.vendor
property by @un-def in https://github.com/dstackai/dstack/pull/1570
logit_bias: invalid type
by @jvstme in https://github.com/dstackai/dstack/pull/1557
root
in Kubernetes runs by @jvstme in https://github.com/dstackai/dstack/pull/1555
manager
role by @r4victor in https://github.com/dstackai/dstack/pull/1572
tpu-
prefix; add tpu
vendor alias by @un-def in https://github.com/dstackai/dstack/pull/1587
Full changelog: https://github.com/dstackai/dstack/compare/0.18.10...0.18.11
Published by peterschmidt85 about 2 months ago
With the latest update, you can now specify an AMD GPU under resources
. Below is an example.
type: service
name: amd-service-tgi
image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
- TRUST_REMOTE_CODE=true
- ROCM_USE_FLASH_ATTN_V2_TRITON=true
commands:
- text-generation-launcher --port 8000
port: 8000
resources:
gpu: MI300X
disk: 150GB
spot_policy: auto
model:
type: chat
name: meta-llama/Meta-Llama-3.1-70B-Instruct
format: openai
[!NOTE]
AMD accelerators are currently supported only with therunpod
backend. Support for on-prem fleets and more backends
is coming soon.
root
in Kubernetes runs by @jvstme in https://github.com/dstackai/dstack/pull/1555
logit_bias: invalid type
by @jvstme in https://github.com/dstackai/dstack/pull/1557
vendor
property under gpu
@un-def in https://github.com/dstackai/dstack/pull/1558
manager
role by @r4victor in https://github.com/dstackai/dstack/pull/1572
pkg_resources
with importlib.resources
by @r4victor in https://github.com/dstackai/dstack/pull/1582
manager
project role @olgenn in https://github.com/dstackai/dstack/pull/1566
tpu-
prefix; add tpu
vendor alias by @un-def in https://github.com/dstackai/dstack/pull/1587
gpu.vendor
property by @un-def in https://github.com/dstackai/dstack/pull/1570
Full changelog: https://github.com/dstackai/dstack/compare/0.18.10...0.18.11rc1
Published by peterschmidt85 2 months ago
As a user, you most likely access dstack
using its CLI. At the same time, the dstack
server hosts a control plane that offers a wide range of functionality. It orchestrates cloud infrastructure, manages the state of resources, checks access, and much more.
Previously, managing projects and users was only possible via the API. The latest dstack
update introduces a full-fledged web-based user interface, which you can now access on the same port where the server is hosted.
The user interface allows you to configure projects, users, their permissions, manage resources and workloads, and much more.
To learn more about how to manage projects, users, and their permissions, check out the Projects page.
Previously, it wasn't possible to use environment variables to configure credentials for a private Docker registry. With this update, you can now use the following interpolation syntax to avoid hardcoding credentials in the configuration.
type: dev-environment
name: train
env:
- DOCKER_USER
- DOCKER_USERPASSWORD
image: dstackai/base:py3.10-0.4-cuda-12.1
registry_auth:
username: ${{ env.DOCKER_USER }}
password: ${{ env.DOCKER_USERPASSWORD }}
When you run a dev environment or a task with dstack apply
, it automatically forwards the remote ports to localhost. However, these ports are, by default, bound to 127.0.0.1
. If you'd like to make a port available on an arbitrary host, you can now specify the host using the --host
option.
For example, this command will make the port available on all network interfaces:
dstack apply --host 0.0.0.0 -f my-task.dstack.yml
--host HOST
arg to dstack apply
command by @un-def in https://github.com/dstackai/dstack/pull/1531
dstack
CLI exits with non-zero exit code on errors by @r4victor in https://github.com/dstackai/dstack/pull/1529
http
services running on 443 in the logs by @r4victor in https://github.com/dstackai/dstack/pull/1522
root
user in custom Docker images by @jvstme in https://github.com/dstackai/dstack/pull/1538
dstack
VM images by @jvstme in https://github.com/dstackai/dstack/pull/1536
nvcc
property by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1526
env
for on-prem fleets #1527 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1530
dstack
image version to 0.5
by @jvstme in https://github.com/dstackai/dstack/pull/1541
All changes: https://github.com/dstackai/dstack/compare/0.18.9...0.18.10
Published by peterschmidt85 2 months ago
nvcc
If you don't specify a custom Docker image, dstack
uses its own base image with essential CUDA drivers, python
, pip
, and conda
(Miniforge). Previously, this image didn't include nvcc
, needed for compiling custom CUDA kernels (e.g., Flash Attention).
With version 0.18.9, you can now include nvcc
.
type: task
python: "3.10"
# This line ensures `nvcc` is included into the base Docker image
nvcc: true
commands:
- pip install -r requirements.txt
- python train.py
resources:
gpu: 24GB
When you create an on-prem fleet, it's now possible to pre-configure environment variables. These variables will be used when installing the dstack-shim
service on hosts and running workloads.
For example, these environment variables can be used to configure dstack
to use a proxy:
type: fleet
name: my-fleet
placement: cluster
env:
- HTTP_PROXY=http://proxy.example.com:80
- HTTPS_PROXY=http://proxy.example.com:80
- NO_PROXY=localhost,127.0.0.1
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 3.255.177.51
- 3.255.177.52
New examples include:
shim.log
by @jvstme in https://github.com/dstackai/dstack/pull/1503
env
setting to fleet config for on-prem fleets by @un-def in https://github.com/dstackai/dstack/pull/1505
Full changelog: https://github.com/dstackai/dstack/compare/0.18.8...0.18.9
Published by r4victor 3 months ago
#1477 added support for gcp
volumes:
type: volume
name: my-gcp-volume
backend: gcp
region: europe-west1
size: 100GB
Previously, volumes were only supported for aws
and runpod
.
#1486 fixed a major bug introduced in 0.18.7 that could lead to instances not being terminated in the cloud.
Full Changelog: https://github.com/dstackai/dstack/compare/0.18.7...0.18.8
Published by peterschmidt85 3 months ago
With fleets, you can now describe clusters declaratively and create them in both cloud and on-prem with a single command. Once a fleet is created, it can be used with dev environments, tasks, and services.
To provision a fleet in the cloud, specify the required resources, number of nodes, and other optional parameters.
type: fleet
name: my-fleet
placement: cluster
nodes: 2
resources:
gpu: 24GB
To create a fleet from on-prem servers, specify their hosts along with the user, port, and SSH key for connection via SSH.
type: fleet
name: my-fleet
placement: cluster
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 3.255.177.51
- 3.255.177.52
To create or update the fleet, simply call the dstack apply command:
dstack apply -f examples/fleets/my-fleet.dstack.yml
Learn more about fleets in the documentation.
dstack run
Now that we support dstack apply for gateways, volumes, and fleets, we have extended this support to dev environments, tasks, and services. Instead of using dstack run WORKING_DIR -f CONFIG_FILE
, you can now use dstack apply -f CONFIG_FILE
.
Also, it's now possible to specify a name
for dev environments, tasks, and services, just like for gateways, volumes, and fleets.
type: dev-environment
name: my-ide
python: "3.11"
ide: vscode
resources:
gpu: 80GB
This name
is used as a run name and is more convenient than a random name. However, if you don't specify a name
, dstack
will assign a random name as before.
In other news, we've added support for volumes in the runpod
backend. Previously, they were only supported in the aws
backend.
type: volume
name: my-new-volume
backend: runpod
region: ca-mtl-3
size: 100GB
A great feature of the runpod
's volumes is their ability to attach to multiple instances simultaneously. This allows for persisting cache across multiple service replicas or supporting distributed training tasks.
[!IMPORTANT]
This update fixes the brokenkubernetes
backend, which has been non-functional since a few previous updates.
--gpu
override YAML's gpu
by @r4victor in https://github.com/dstackai/dstack/pull/1455busy
offers from the top of offers list by @jvstme in https://github.com/dstackai/dstack/pull/1452
dstack volume delete
by @r4victor in https://github.com/dstackai/dstack/pull/1434
provisioning
for container backends by @r4victor in * [Docs] Fix typos by @jvstme in https://github.com/dstackai/dstack/pull/1426
DSTACK_SENTRY_PROFILES_SAMPLE_RATE
by @r4victor in https://github.com/dstackai/dstack/pull/1428
ruff
to 0.5.3
by @jvstme in https://github.com/dstackai/dstack/pull/1421
--gpu
override YAML's gpu
: by @r4victor in https://github.com/dstackai/dstack/pull/1455
regions
for runpod
by @r4victor in https://github.com/dstackai/dstack/pull/1460
** Full changelog**: https://github.com/dstackai/dstack/compare/0.18.6...0.18.7
Published by peterschmidt85 3 months ago
This is a preview build of the upcoming 0.18.7
update, bringing a few major new features and many bug fixes.
[!IMPORTANT]
With fleets, you can now describe clusters declaratively and create them in both cloud and on-prem with a single command. Once a fleet is created, it can be used with dev environments, tasks, and services.
To provision a fleet in the cloud, specify the required resources, number of nodes, and other optional parameters.
type: fleet
name: my-fleet
placement: cluster
nodes: 2
resources:
gpu: 24GB
To create a fleet from on-prem servers, specify their hosts along with the user, port, and SSH key for connection via SSH.
type: fleet
name: my-fleet
placement: cluster
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 3.255.177.51
- 3.255.177.52
To create or update the fleet, simply call the dstack apply command:
dstack apply -f examples/fleets/my-fleet.dstack.yml
Learn more about fleets in the documentation.
dstack run
[!IMPORTANT]
Now that we support dstack apply for gateways, volumes, and fleets, we have extended this support to dev environments, tasks, and services. Instead of usingdstack run WORKING_DIR -f CONFIG_FILE
, you can now usedstack apply -f CONFIG_FILE
.
Also, it's now possible to specify a name
for dev environments, tasks, and services, just like for gateways, volumes, and fleets.
type: dev-environment
name: my-ide
python: "3.11"
ide: vscode
resources:
gpu: 80GB
This name
is used as a run name and is more convenient than a random name. However, if you don't specify a name
, dstack
will assign a random name as before.
[!IMPORTANT]
In other news, we've added support for volumes in therunpod
backend. Previously, they were only supported in theaws
backend.
type: volume
name: my-new-volume
backend: runpod
region: ca-mtl-3
size: 100GB
A great feature of the runpod
's volumes is their ability to attach to multiple instances simultaneously. This allows for persisting cache across multiple service replicas or supporting distributed training tasks.
[!IMPORTANT]
This update fixes the brokenkubernetes
backend, which has been non-functional since a few previous updates.
--gpu
override YAML's gpu
by @r4victor in https://github.com/dstackai/dstack/pull/1455busy
offers from the top of offers list by @jvstme in https://github.com/dstackai/dstack/pull/1452
dstack volume delete
by @r4victor in https://github.com/dstackai/dstack/pull/1434
provisioning
for container backends by @r4victor in * [Docs] Fix typos by @jvstme in https://github.com/dstackai/dstack/pull/1426
DSTACK_SENTRY_PROFILES_SAMPLE_RATE
by @r4victor in https://github.com/dstackai/dstack/pull/1428
ruff
to 0.5.3
by @jvstme in https://github.com/dstackai/dstack/pull/1421
** Full changelog**: https://github.com/dstackai/dstack/compare/0.18.6...0.18.7rc2
Published by peterschmidt85 3 months ago
H100
with the gcp
backend by @jvstme in https://github.com/dstackai/dstack/pull/1405
[!WARNING]
If you have idle instances in your pool, it is recommended to re-create them after upgrading to version 0.18.6. Otherwise, there is a risk that these instances won't be able to execute jobs.
dstack-runner
repo tests by @jvstme in https://github.com/dstackai/dstack/pull/1418
Full changelog: https://github.com/dstackai/dstack/compare/0.18.5...0.18.6
Published by peterschmidt85 3 months ago
Read below about its new features and bug-fixes.
When you run anything with dstack
, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5
, we're adding support for network volumes that allow data to persist across runs.
Once you've created a volume (e.g. named my-new-volume
), you can attach it to a dev environment, task, or service.
type: dev-environment
ide: vscode
volumes:
- name: my-new-volume
path: /volume_data
The data stored in the volume will persist across runs.
dstack
allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.
[!IMPORTANT]
Volumes are currently experimental and only work with theaws
backend. Support for other backends is coming soon.
By default, dstack
stores its state in ~/.dstack/server/data
using SQLite. With this update, it's now possible to configure dstack
to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL
environment variable.
DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server
[!IMPORTANT]
Despite PostgreSQL support,dstack
still requires that you run only one instance of thedstack
server. However, this requirement will be lifted in a future update.
Previously, dstack
didn't allow the use of on-prem clusters (added via dstack pool add-ssh
) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.
Previously, dstack
didn't support L4
and H100
GPUs with AWS. Now you can use them.
dstack
VM images by @jvstme in https://github.com/dstackai/dstack/pull/1389
dstack
Docker images by @jvstme in https://github.com/dstackai/dstack/pull/1391
pool add-ssh
by @jvstme in https://github.com/dstackai/dstack/pull/1396
See more: https://github.com/dstackai/dstack/compare/0.18.4...0.18.5
Published by peterschmidt85 3 months ago
This is a release candidate build of the upcoming 0.18.5
release. Read below to learn about its new features and bug-fixes.
When you run anything with dstack
, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5
, we're adding support for network volumes that allow data to persist across runs.
Once you've created a volume (e.g. named my-new-volume
), you can attach it to a dev environment, task, or service.
type: dev-environment
ide: vscode
volumes:
- name: my-new-volume
path: /volume_data
The data stored in the volume will persist across runs.
dstack
allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.
[!IMPORTANT]
Volumes are currently experimental and only work with theaws
backend. Support for other backends is coming soon.
By default, dstack
stores its state in /root/.dstack/server/data
using SQLite. With this update, it's now possible to configure dstack
to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL
environment variable.
DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server
[!IMPORTANT]
Despite PostgreSQL support,dstack
still requires that you run only one instance of thedstack
server. However, this requirement will be lifted in a future update.
Previously, dstack
didn't allow the use of on-prem clusters (added via dstack pool add-ssh
) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.
Previously, dstack
didn't support L4
and H100
GPUs with AWS. Now you can use them.
dstack
VM images by @jvstme in https://github.com/dstackai/dstack/pull/1389
dstack
Docker images by @jvstme in https://github.com/dstackai/dstack/pull/1391
pool add-ssh
by @jvstme in https://github.com/dstackai/dstack/pull/1396
See more: https://github.com/dstackai/dstack/compare/0.18.4...0.18.5rc1
Published by peterschmidt85 4 months ago
This update introduces initial support for Google Cloud TPU.
To request a TPU, specify the TPU architecture prefixed by tpu-
(in gpu
under resources
):
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
[!IMPORTANT]
Currently, you can't specify other than 8 TPU cores. This means only single TPU device workloads are supported. Support for multiple TPU devices is coming soon.
Additionally, the update allows configuring the gcp
backend to use only private subnets. To achieve this, set public_ips
to false
.
projects:
- name: main
backends:
- type: gcp
creds:
type: default
public_ips: false
Besides TPU, the update fixes a few important bugs.
cudo
backend stuck && Improve docs for cudo
by @smokfyz in https://github.com/dstackai/dstack/pull/1347
nvidia-smi
not available on lambda
by @r4victor in https://github.com/dstackai/dstack/pull/1357
registry_auth
for RunPod by @smokfyz in https://github.com/dstackai/dstack/pull/1333
oci
by @jvstme in https://github.com/dstackai/dstack/pull/1334
ssh
version by @loghijiaha in https://github.com/dstackai/dstack/pull/1313
oci
Bare Metal instances by @jvstme in https://github.com/dstackai/dstack/pull/1325
oci
BM.Optimized3.36
instance by @jvstme in https://github.com/dstackai/dstack/pull/1328
dstack pool
docs by @jvstme in https://github.com/dstackai/dstack/pull/1329
gcp
by @Bihan in https://github.com/dstackai/dstack/pull/1323
runner-test
workflow by @r4victor in https://github.com/dstackai/dstack/pull/1336
22
, 80
, and 443
by @smokfyz in https://github.com/dstackai/dstack/pull/1335
serve.dstack.yml
- infinity by @michaelfeil in https://github.com/dstackai/dstack/pull/1340
/api/pools/list_instances
by @r4victor in https://github.com/dstackai/dstack/pull/1320
gcp
VPC config when provisioning TPUs by @r4victor in https://github.com/dstackai/dstack/pull/1332
Full changelog: https://github.com/dstackai/dstack/compare/0.18.3...0.18.4
Published by peterschmidt85 4 months ago
This is a preview build of the upcoming 0.18.4
release. See below to see what's new.
One of the major new features in this update is the initial support for Google Cloud TPU.
To request a TPU, you simply need to specify the system architecture of the required TPU prefixed by tpu-
in gpu
:
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
[!IMPORTANT]
You cannot request multiple nodes (for running parallel on multiple TPU devices) for tasks. This feature is coming soon.
You're very welcome to try the initial support and share your feedback.
Besides TPU, the update fixes a few important bugs.
cudo
backend stuck && Improve docs for cudo
by @smokfyz in https://github.com/dstackai/dstack/pull/1347
nvidia-smi
not available on lambda
by @r4victor in https://github.com/dstackai/dstack/pull/1357
registry_auth
for RunPod by @smokfyz in https://github.com/dstackai/dstack/pull/1333
oci
by @jvstme in https://github.com/dstackai/dstack/pull/1334
ssh
version by @loghijiaha in https://github.com/dstackai/dstack/pull/1313
oci
Bare Metal instances by @jvstme in https://github.com/dstackai/dstack/pull/1325
oci
BM.Optimized3.36
instance by @jvstme in https://github.com/dstackai/dstack/pull/1328
dstack pool
docs by @jvstme in https://github.com/dstackai/dstack/pull/1329
gcp
by @Bihan in https://github.com/dstackai/dstack/pull/1323
runner-test
workflow by @r4victor in https://github.com/dstackai/dstack/pull/1336
22
, 80
, and 443
by @smokfyz in https://github.com/dstackai/dstack/pull/1335
serve.dstack.yml
- infinity by @michaelfeil in https://github.com/dstackai/dstack/pull/1340
/api/pools/list_instances
by @r4victor in https://github.com/dstackai/dstack/pull/1320
gcp
VPC config when provisioning TPUs by @r4victor in https://github.com/dstackai/dstack/pull/1332
Full changelog: https://github.com/dstackai/dstack/compare/0.18.3...0.18.4rc3
Published by peterschmidt85 5 months ago
With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci
and can be configured as follows:
projects:
- name: main
backends:
- type: oci
creds:
type: default
The supported credential types include default
and client
. In case default
is used, dstack
automatically picks the default OCI credentials from ~/.oci/config
.
[!WARNING]
OCI support does not yet include spot instances, multi-node tasks, and gateways. These features are coming soon.
We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:
type: task
commands:
- python train.py
retry:
on_events: [no-capacity]
duration: 2h
Now, if you run such a task, dstack
will keep trying to find capacity within 2 hours. Once capacity is found, dstack
will run the task.
The on_events
property also supports error
(in case the run fails with an error) and interruption
(if the run is using a spot instance and it was interrupted).
Previously, dstack
only allowed retries when spot instances were interrupted.
Previously, the runpod
backend only allowed the use of Docker images with /bin/bash
or /bin/sh
as the entrypoint. Thanks to the fix on the RunPod's side, dstack
now allows the use of any Docker images.
Additionally, the runpod
backend now also supports spot instances.
The gcp
backend now also allows configuring VPCs:
projects:
- name: main
backends:
- type: gcp
project_id: my-awesome-project
creds:
type: default
vpc_name: my-custom-vpc
The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id
.
Last but not least, for the aws
backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:
projects:
- name: main
backends:
- type: aws
creds:
type: default
vpc_ids:
us-east-1: vpc-0a2b3c4d5e6f7g8h
default_vpcs: true
You just need to set default_vpcs
to true
.
ssh
backend by @r4victor in https://github.com/dstackai/dstack/pull/1278
dstack run
attached mode by @r4victor in https://github.com/dstackai/dstack/pull/1285
unreachable
instances by @r4victor in https://github.com/dstackai/dstack/pull/1286
dstack ps -v
by @r4victor in https://github.com/dstackai/dstack/pull/1301
dstack destroy
to dstack delete
by @r4victor in https://github.com/dstackai/dstack/pull/1275
Published by peterschmidt85 5 months ago
With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci
and can be configured as follows:
projects:
- name: main
backends:
- type: oci
creds:
type: default
The supported credential types include default
and client
. In case default
is used, dstack
automatically picks the default OCI credentials from ~/.oci/config
.
[!WARNING]
OCI support does not yet include spot instances, multi-node tasks, and gateways. These features will be added in upcoming updates.
We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:
type: task
commands:
- python train.py
retry:
on_events: [no-capacity]
duration: 2h
Now, if you run such a task, dstack
will keep trying to find capacity within 2 hours. Once capacity is found, dstack
will run the task.
The on_events
property also supports error
(in case the run fails with an error) and interruption
(if the run is using a spot instance and it was interrupted).
Previously, dstack
only allowed retries when spot instances were interrupted.
The gcp
backend now also allows configuring VPCs:
projects:
- name: main
backends:
- type: gcp
project_id: my-awesome-project
creds:
type: default
vpc_name: my-custom-vpc
The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id
.
Last but not least, for the aws
backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:
projects:
- name: main
backends:
- type: aws
creds:
type: default
vpc_ids:
us-east-1: vpc-0a2b3c4d5e6f7g8h
default_vpcs: true
You just need to set default_vpcs
to true
.
ssh
backend by @r4victor in https://github.com/dstackai/dstack/pull/1278
dstack run
attached mode by @r4victor in https://github.com/dstackai/dstack/pull/1285
unreachable
instances by @r4victor in https://github.com/dstackai/dstack/pull/1286
dstack ps -v
by @r4victor in https://github.com/dstackai/dstack/pull/1301
dstack destroy
to dstack delete
by @r4victor in https://github.com/dstackai/dstack/pull/1275
Full changelog: https://github.com/dstackai/dstack/compare/0.18.2...0.18.3rc1
[!WARNING]
This is an RC build. Please report any bugs to the issue tracker. The final release is planned for later this week, and the official documentation and examples will be updated then.
Published by peterschmidt85 5 months ago
The dstack pool add-ssh
command now supports the --network
argument. Use this argument if you want to use multiple instances that share the same private network as a cluster to run multi-node tasks.
The --network
argument accepts the IP address range (CIDR) of the private network of the instance.
Example:
dstack pool add-ssh -i ~/.ssh/id_rsa [email protected] --network 10.0.0.0/24
Once you've added multiple instances with the same network value, you'll be able to use them as a cluster to run multi-node tasks.
By default, dstack
uses public IPs for SSH access to running instances, requiring public subnets in the VPC. The new update allows AWS instances to use private subnets instead.
To create instances only in private subnets, set public_ips
to false
in the AWS backend settings:
type: aws
creds:
type: default
vpc_ids:
...
public_ips: false
[!NOTE]
- Both
dstack server
and thedstack
CLI should have access to the private subnet to access instances.- If you want running instances to access the Internet, the private subnets need to have a NAT gateway.
dstack apply
Previously, to create or update gateways, one had to use the dstack gateway create
or dstack gateway update
commands.
Now, it's possible to define a gateway configuration via YAML and create or update it using the dstack apply
command.
Example:
type: gateway
name: example-gateway
backend: gcp
region: europe-west1
domain: example.com
dstack apply -f examples/deployment/gateway.dstack.yml
For now, the dstack apply
command only supports the gateway
configuration type. Soon, it will also support dev-environment
, task
, and service
, replacing the dstack run
command.
The dstack destroy
command can be used to delete resources.
By default, gateways are deployed using public subnets. Since 0.18.2
, it is now possible to deploy gateways using private subnets. To do this, you need to set public_ips
to false
and specify the ARN of a certificate from AWS Certificate Manager.
type: gateway
name: example-gateway
backend: aws
region: eu-west-1
domain: "example.com"
public_ip: false
certificate:
type: acm
arn: "arn:aws:acm:eu-west-1:3515152512515:certificate/3251511125--1241-1224-121251515125"
In this case, dstack
will deploy the gateway in a private subnet behind a load balancer using the specified certificate.
[!NOTE]
Private gateways are currently supported only for AWS.
dstack pool add-ssh
instances by @TheBits in https://github.com/dstackai/dstack/pull/1189
runpod
by @Bihan in https://github.com/dstackai/dstack/pull/1119
ProjectModel
loading by @r4victor in https://github.com/dstackai/dstack/pull/1199
dstack pool add-ssh
by @TheBits in https://github.com/dstackai/dstack/pull/1202
https
by @r4victor in https://github.com/dstackai/dstack/pull/1217
dstack apply
for gateways by @r4victor in https://github.com/dstackai/dstack/pull/1223
--network
with dstack pool add-ssh
by @TheBits in https://github.com/dstackai/dstack/pull/1225
network
passed to dstack pool add-ssh
is not correct by @TheBits in https://github.com/dstackai/dstack/pull/1233
dstack-shim.service
with dstack pool add-ssh
by @TheBits in https://github.com/dstackai/dstack/pull/1253
dstack pool remove
to rm
by @muddi900 in https://github.com/dstackai/dstack/pull/1258
--network
by @TheBits in https://github.com/dstackai/dstack/pull/1263
axolotl
example by @deep-diver in https://github.com/dstackai/dstack/pull/1187
Full Changelog: https://github.com/dstackai/dstack/compare/0.18.1...0.18.2
Published by peterschmidt85 6 months ago
Now you can add your own servers as pool instances:
dstack pool add-ssh -i ~/.ssh/id_rsa [email protected]
[!NOTE]
The server should be pre-installed with CUDA 12.1 and NVIDIA Docker.
All .dstack/profiles.yml
properties now can be specified via run configurations:
type: dev-environment
ide: vscode
spot_policy: auto
backends: ["aws"]
regions: ["eu-west-1", "eu-west-2"]
instance_types: ["p3.8xlarge", "p3.16xlarge"]
max_price: 2.0
max_duration: 1d
Thanks to the contribution from @deep-diver, we got two new examples:
vpc_ids
in server/config.yml
)~/.dstack/profiles.yml
)DSTACK_RUN_NAME
, DSTACK_GPUS_NUM
, DSTACK_NODES_NUM
, DSTACK_NODE_RANK
, and DSTACK_MASTER_NODE_IP
)A10
GPU on Azure.dstack/profiles.yml
by @r4victor in https://github.com/dstackai/dstack/pull/1134
No such profile: None
when missing .dstack/profiles.yml
by @r4victor in https://github.com/dstackai/dstack/pull/1135
run_job/create_instance
by @r4victor in https://github.com/dstackai/dstack/pull/1149
packer
-> scripts/packer
by @jvstme in https://github.com/dstackai/dstack/pull/1153
executor_error
check being falsely positive by @TheBits in https://github.com/dstackai/dstack/pull/1160
vpc_ids
by @r4victor in https://github.com/dstackai/dstack/pull/1170
KeyError: 'IpPermissions'
when using AWS by @jvstme in https://github.com/dstackai/dstack/pull/1169
dstack pool add-ssh
by @TheBits in https://github.com/dstackai/dstack/pull/1173
dstack pool add-ssh
by @TheBits in https://github.com/dstackai/dstack/pull/1178
HUGGING_FACE_HUB_TOKEN
by @r4victor in https://github.com/dstackai/dstack/pull/1184
DSTACK_*
by @r4victor in https://github.com/dstackai/dstack/pull/1185
Full Changelog: https://github.com/dstackai/dstack/compare/0.18.0...0.18.1rc2
Published by peterschmidt85 6 months ago
This is a preview of the upcoming 0.18.0
update. Read below to see what improvements it brings.
The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.
To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml
:
projects:
- name: main
backends:
- type: runpod
creds:
type: api_key
api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9
Once the server is restarted, go ahead and run workloads:
Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.
Simply specify the nodes
property for your task (to the number of required nodes) and run it.
type: task
nodes: 2
commands:
- git clone https://github.com/r4victor/pytorch-distributed-resnet.git
- cd pytorch-distributed-resnet
- mkdir -p data
- cd data
- wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
- tar -xvzf cifar-10-python.tar.gz
- cd ..
- pip3 install -r requirements.txt torch
- mkdir -p saved_models
- torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008 resnet_ddp.py
--num_epochs 20
resources:
gpu: 1
Currently supported providers for this feature include AWS, GCP, and Azure. For other providers or on-premises servers, file the corresponding feature requests or ping on Discord.
One more small improvement is that the commands property is now not required for tasks and services if you use an image that has a default entrypoint configured.
type: task
image: r8.im/bytedance/sdxl-lightning-4step
ports:
- 5000
resources:
gpu: 24GB
The update also improves the output of the dstack server
command:
Last but not least, we've made the permissions required for using dstack
with GCP more granular.
compute.disks.create
compute.firewalls.create
compute.images.useReadOnly
compute.instances.create
compute.instances.delete
compute.instances.get
compute.instances.setLabels
compute.instances.setMetadata
compute.instances.setTags
compute.networks.updatePolicy
compute.regions.list
compute.subnetworks.use
compute.subnetworks.useExternalIp
compute.zoneOperations.get
username
filter to /api/runs/list
by @r4victor in https://github.com/dstackai/dstack/pull/1068
replicas
by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1055
server/config.yml
reference documentation by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1077
runpod
backend by @Bihan in https://github.com/dstackai/dstack/pull/1063
terminate_idle_instance
by @TheBits in https://github.com/dstackai/dstack/pull/1081
dstack init
doesn't work with a remote Git repo by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1090
dstack server
output by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1088
dstack-shim
by @TheBits in https://github.com/dstackai/dstack/pull/1061
RetryPolicy.limit
to RetryPolicy.duration
by @TheBits in https://github.com/dstackai/dstack/pull/1074
dstack version
configurable when deploying docs by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1095
dstack init
doesn't work with a local Git repo by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1096
create_instance()
on the cudo
provider by @r4victor in https://github.com/dstackai/dstack/pull/1082
latest
Docker image and YAML scheme for pre-release builds by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1099
commands
optional in run configurations by @jvstme in https://github.com/dstackai/dstack/pull/1104
cudo
backend use non-gpu instances by @Bihan in https://github.com/dstackai/dstack/pull/1092
Full changelog: https://github.com/dstackai/dstack/compare/0.17.0...0.18.0rc3
This is not the final release yet. If you encounter any bugs, report them directly via issues, or on our Discord.
Published by peterschmidt85 7 months ago
The latest update previews service replicas and auto-scaling, and brings many other improvements.
Previously, dstack
always served services as single replicas. While this is suitable for development, in production, the service must automatically scale based on the load.
That's why in 0.17.0
, we extended dstack
with the capability to configure replicas
(the number of replicas) as well as scaling
(the auto-scaling policy).
The update brings support for specifying regions and instance types (in dstack run
and .dstack/profiles.yml
)
Firstly, it's now possible to configure an environment variable in the configuration without hardcoding its value. Secondly, dstack run
now inherits environment variables from the current process.
For more details on these new features, check the changelog.
instance_type
via CLI and profiles by @r4victor in https://github.com/dstackai/dstack/pull/1023
shm_size
property in resources doesn't take effect by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1007
vastai
doesn't show any offers since 0.16.0
by @iRohith in https://github.com/dstackai/dstack/pull/959
main
by @peterschmidt85 in https://github.com/dstackai/dstack/pull/992
Full changelog: https://github.com/dstackai/dstack/compare/0.16.5...0.17.0