Bot releases are hidden (Show)

dstack - 0.18.12 Latest Release

Published by un-def about 2 months ago

Features

Added support for ECDSA and Ed25519 keys for on-prem fleets by @swsvc in #1641

Major bugfixes

Fixed the order of CloudWatch log events in the web interface by @un-def in #1613
Fixed a bug where CloudWatch log events might not be displayed in the web inferface for old runs by @un-def in #1652
Prevent possible server freeze on SSH connections by @jvstme in #1627

Other changes

[CLI] Show run name before detaching by @jvstme in #1607
Increase time waiting for OCI Bare Metal instances by @jvstme in #1630
Update lambda regions by @r4victor in #1634
Change CloudWatch group check method by @un-def in #1615
Add Postgres tests by @r4victor in #1628
Fix lambda tests by @r4victor in #1635
[Docs] Fixed a bug where search included non-existing pages that land to 404 by @peterschmidt85 in #1646
[Docs] Introduce the Providers page by @peterschmidt85 in #1653
[Docs] Update RunPod & DataCrunch setup guides by @jvstme in #1608
[Docs] Add information about run log storage by @un-def in #1621
[Internal] Update packer templates docs by @jvstme in #1619

Full changelog: https://github.com/dstackai/dstack/compare/0.18.11...0.18.12

dstack - 0.18.12rc1

Published by un-def about 2 months ago

Features

Added support for ECDSA and Ed25519 keys for on-prem fleets by @swsvc in #1641

Major bugfixes

Fixed the order of CloudWatch log events in the web interface by @un-def in #1613
Fixed a bug where CloudWatch log events might not be displayed in the web inferface for old runs by @un-def in #1652
Prevent possible server freeze on SSH connections by @jvstme in #1627

Other changes

[CLI] Show run name before detaching by @jvstme in #1607
Increase time waiting for OCI Bare Metal instances by @jvstme in #1630
Update lambda regions by @r4victor in #1634
Change CloudWatch group check method by @un-def in #1615
Add Postgres tests by @r4victor in #1628
Fix lambda tests by @r4victor in #1635
[Docs] Fixed a bug where search included non-existing pages that land to 404 by @peterschmidt85 in #1646
[Docs] Introduce the Providers page by @peterschmidt85 in #1653
[Docs] Update RunPod & DataCrunch setup guides by @jvstme in #1608
[Docs] Add information about run log storage by @un-def in #1621
[Internal] Update packer templates docs by @jvstme in #1619

Full changelog: https://github.com/dstackai/dstack/compare/0.18.11...0.18.12rc1

dstack - 0.18.11

Published by peterschmidt85 about 2 months ago

AMD

With the latest update, you can now specify an AMD GPU under resources. Below is an example.

type: service
name: amd-service-tgi

image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
  - TRUST_REMOTE_CODE=true
  - ROCM_USE_FLASH_ATTN_V2_TRITON=true
commands:
  - text-generation-launcher --port 8000
port: 8000

resources:
  gpu: MI300X
  disk: 150GB

spot_policy: auto

model:
  type: chat
  name: meta-llama/Meta-Llama-3.1-70B-Instruct
  format: openai

[!NOTE]
AMD accelerators are currently supported only with the runpod backend. Support for on-prem fleets and more backends
is coming soon.

GPU vendors

The gpu property now accepts the vendor attribute, with supported values: nvidia, tpu, and amd.

Alternatively, you can also prefix the GPU name with the vendor name followed by a colon, for example: tpu:v2-8 or amd:192GB, etc. This change ensures consistency in GPU requirements configuration across vendors.

Encryption

dstack now supports encryption of sensitive data, such as backend credentials, user tokens, etc. Learn more on the reference page.

Storing logs in AWS CloudWatch

By default, the dstack server stores run logs in ~/.dstack/server/projects/<project name>/logs. To store logs in AWS CloudWatch, set the DSTACK_SERVER_CLOUDWATCH_LOG_GROUP environment variable.

Project manager role

With this update, it's now possible to assign any user as a project manager. This role grants permission to manage project users but does not allow management of backends or resources.

Default permissions

By default, all users can create and manage their own projects. If you want only global admins to create projects, add the following to ~/.dstack/server/config.yml:

default_permissions:
  allow_non_admins_create_projects: false

Other

[Feature] Allow to store logs in AWS CloudWatch by @un-def in https://github.com/dstackai/dstack/pull/1597 and https://github.com/dstackai/dstack/pull/1597
[Feature] Introduce default permissions #1559 by @olgenn in https://github.com/dstackai/dstack/pull/1567
[Feature] Support the vendor property under resources.gpu @un-def in https://github.com/dstackai/dstack/pull/1558
[Feature] Implement configurable default permissions by @r4victor in https://github.com/dstackai/dstack/pull/1591
[Bugfix] Provision AWS instances in all eligible availability zones by @r4victor in https://github.com/dstackai/dstack/pull/1585
[Bugfix] Support users without projects @olgenn in https://github.com/dstackai/dstack/pull/1578
[UI] Support manager project role @olgenn in https://github.com/dstackai/dstack/pull/1566
[Docs] Mention AMD GPUs, describe gpu.vendor property by @un-def in https://github.com/dstackai/dstack/pull/1570
[Bugfix] Fix global admin restricted by manager role by @r4victor in https://github.com/dstackai/dstack/pull/1592
[Bugfix] Fixed defect with incorrect setting project role in the UI by @olgenn in https://github.com/dstackai/dstack/pull/1593
[Bugfix] Abort provisioning fleet when parsing ssh key fails(#1442) by @swsvc in https://github.com/dstackai/dstack/pull/1589
[UI] Ensure users can create projects #191 by @olgenn in https://github.com/dstackai/dstack/pull/1554
[UI] Use a toggle button switching themes #190 by @olgenn in https://github.com/dstackai/dstack/pull/1556
[UI] Fix the Logs component appearance for the dark theme by @olgenn in https://github.com/dstackai/dstack/pull/1579
[UI] Minor restyle of the side navigation by @olgenn in https://github.com/dstackai/dstack/pull/1580
[Bugfix] Avoid TGI error logit_bias: invalid type by @jvstme in https://github.com/dstackai/dstack/pull/1557
[Docs] Document projects #1547 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1548
[Docs] Document AMD support on RunPod by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1598
[Internal] Approximate on-prem GPU memory size by @jvstme in https://github.com/dstackai/dstack/pull/1588
[Docs] Fix some of the broken links by @jvstme in https://github.com/dstackai/dstack/pull/1602
[Docs] Fix broken links in README.md by @jvstme in https://github.com/dstackai/dstack/pull/1604
[Docs] Document configuring logs storage in AWS CloudWatch @un-def in https://github.com/dstackai/dstack/pull/1606
[Docs] Publish the blog post and examples about AMD on RunPod by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1598
[Internal] Force root in Kubernetes runs by @jvstme in https://github.com/dstackai/dstack/pull/1555
[Internal] Improve gateway auth issues troubleshooting by @jvstme in https://github.com/dstackai/dstack/pull/1569
[Feature] Implement "encryption at rest" by @r4victor in https://github.com/dstackai/dstack/pull/1561
[Feature] Implement project manager role by @r4victor in https://github.com/dstackai/dstack/pull/1572
[Feature] Implement user activation/deactivation by @r4victor in https://github.com/dstackai/dstack/pull/1575
[Internal] Reintroduce tpu- prefix; add tpu vendor alias by @un-def in https://github.com/dstackai/dstack/pull/1587

New contributors

@swsvc made their first contribution in https://github.com/dstackai/dstack/pull/1589

Full changelog: https://github.com/dstackai/dstack/compare/0.18.10...0.18.11

dstack - 0.18.11rc1

Published by peterschmidt85 about 2 months ago

AMD

With the latest update, you can now specify an AMD GPU under resources. Below is an example.

type: service
name: amd-service-tgi

image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
  - TRUST_REMOTE_CODE=true
  - ROCM_USE_FLASH_ATTN_V2_TRITON=true
commands:
  - text-generation-launcher --port 8000
port: 8000

resources:
  gpu: MI300X
  disk: 150GB

spot_policy: auto

model:
  type: chat
  name: meta-llama/Meta-Llama-3.1-70B-Instruct
  format: openai

[!NOTE]
AMD accelerators are currently supported only with the runpod backend. Support for on-prem fleets and more backends
is coming soon.

Other

[Docs] Document projects #1547 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1548
[UI] Ensure users can create projects #191 by @olgenn in https://github.com/dstackai/dstack/pull/1554
[UI] Use a toggle button switching themes #190 by @olgenn in https://github.com/dstackai/dstack/pull/1556
[Bugfix] Force root in Kubernetes runs by @jvstme in https://github.com/dstackai/dstack/pull/1555
[Bugfix] Avoid TGI error logit_bias: invalid type by @jvstme in https://github.com/dstackai/dstack/pull/1557
Support the vendor property under gpu @un-def in https://github.com/dstackai/dstack/pull/1558
[Internal] Improve gateway auth issues troubleshooting by @jvstme in https://github.com/dstackai/dstack/pull/1569
[Feature] Implement "encryption at rest" by @r4victor in https://github.com/dstackai/dstack/pull/1561
[Feature] Implement project manager role by @r4victor in https://github.com/dstackai/dstack/pull/1572
[Feature] Implement user activation/deactivation by @r4victor in https://github.com/dstackai/dstack/pull/1575
[Bugfix] Support users without projects @olgenn in https://github.com/dstackai/dstack/pull/1578
[UI] Fix the Logs component appearance for the dark theme by @olgenn in https://github.com/dstackai/dstack/pull/1579
[UI] Minor restyle of the side navigation by @olgenn in https://github.com/dstackai/dstack/pull/1580
[Internal] Replace pkg_resources with importlib.resources by @r4victor in https://github.com/dstackai/dstack/pull/1582
[UI] Support manager project role @olgenn in https://github.com/dstackai/dstack/pull/1566
[Bugfix] Provision AWS instances in all eligible availability zones by @r4victor in https://github.com/dstackai/dstack/pull/1585
[Feature] Implement configurable default permissions by @r4victor in https://github.com/dstackai/dstack/pull/1591
[Internal] Reintroduce tpu- prefix; add tpu vendor alias by @un-def in https://github.com/dstackai/dstack/pull/1587
[Docs] Mention AMD GPUs, describe gpu.vendor property by @un-def in https://github.com/dstackai/dstack/pull/1570
[Bugfix] Fix global admin restricted by manager role by @r4victor in https://github.com/dstackai/dstack/pull/1592
[Bugfix] Fixed defect with incorrect setting project role in the UI by @olgenn in https://github.com/dstackai/dstack/pull/1593
[Internal] Order project members by @r4victor in https://github.com/dstackai/dstack/pull/1594
[Feature] Introduce default permissions #1559 by @olgenn in https://github.com/dstackai/dstack/pull/1567
[Bugfix] Abort provisioning fleet when parsing ssh key fails(#1442) by @swsvc in https://github.com/dstackai/dstack/pull/1589
[Feature] Add LogStorage interface, CloudWatch Logs impl by @un-def in https://github.com/dstackai/dstack/pull/1597
[Docs] Document AMD support on RunPod by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1598

New contributors

@swsvc made their first contribution in https://github.com/dstackai/dstack/pull/1589

Full changelog: https://github.com/dstackai/dstack/compare/0.18.10...0.18.11rc1

dstack - 0.18.10

Published by peterschmidt85 2 months ago

Control plane UI

As a user, you most likely access dstack using its CLI. At the same time, the dstack server hosts a control plane that offers a wide range of functionality. It orchestrates cloud infrastructure, manages the state of resources, checks access, and much more.

Previously, managing projects and users was only possible via the API. The latest dstack update introduces a full-fledged web-based user interface, which you can now access on the same port where the server is hosted.

The user interface allows you to configure projects, users, their permissions, manage resources and workloads, and much more.
To learn more about how to manage projects, users, and their permissions, check out the Projects page.

Environment variables interpolation

Previously, it wasn't possible to use environment variables to configure credentials for a private Docker registry. With this update, you can now use the following interpolation syntax to avoid hardcoding credentials in the configuration.

type: dev-environment
name: train

env:
  - DOCKER_USER
  - DOCKER_USERPASSWORD

image: dstackai/base:py3.10-0.4-cuda-12.1
registry_auth:
  username: ${{ env.DOCKER_USER }}
  password: ${{ env.DOCKER_USERPASSWORD }}

Network interfaces for port forwarding

When you run a dev environment or a task with dstack apply, it automatically forwards the remote ports to localhost. However, these ports are, by default, bound to 127.0.0.1. If you'd like to make a port available on an arbitrary host, you can now specify the host using the --host option.

For example, this command will make the port available on all network interfaces:

dstack apply --host 0.0.0.0 -f my-task.dstack.yml

Full changelog

[Feature] Add --host HOST arg to dstack apply command by @un-def in https://github.com/dstackai/dstack/pull/1531
[Feature] Interpolate env in registry_auth by @r4victor in https://github.com/dstackai/dstack/pull/1540
[Bugfix] Ensure dstack CLI exits with non-zero exit code on errors by @r4victor in https://github.com/dstackai/dstack/pull/1529
[Bugfix] Fix http services running on 443 in the logs by @r4victor in https://github.com/dstackai/dstack/pull/1522
[Bugfix] Forece the use of the root user in custom Docker images by @jvstme in https://github.com/dstackai/dstack/pull/1538
[Bugfix] Update Docker to 27.1.1 in dstack VM images by @jvstme in https://github.com/dstackai/dstack/pull/1536
[Feature] Add control plane UI by @olgenn in https://github.com/dstackai/dstack/pull/1524
[Docs] Document the nvcc property by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1526
[Docs] Document env for on-prem fleets #1527 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1530
[Interna] Fix unlocking on transaction rollback by @r4victor in https://github.com/dstackai/dstack/pull/1537
[Internal] Bump base dstack image version to 0.5 by @jvstme in https://github.com/dstackai/dstack/pull/1541

All changes: https://github.com/dstackai/dstack/compare/0.18.9...0.18.10

dstack - 0.18.9

Published by peterschmidt85 2 months ago

Base Docker image with `nvcc`

If you don't specify a custom Docker image, dstack uses its own base image with essential CUDA drivers, python, pip, and conda (Miniforge). Previously, this image didn't include nvcc, needed for compiling custom CUDA kernels (e.g., Flash Attention).

With version 0.18.9, you can now include nvcc.

type: task

python: "3.10"
# This line ensures `nvcc` is included into the base Docker image
nvcc: true

commands:
  - pip install -r requirements.txt
  - python train.py

resources:
  gpu: 24GB

Environment variables for on-prem fleets

When you create an on-prem fleet, it's now possible to pre-configure environment variables. These variables will be used when installing the dstack-shim service on hosts and running workloads.

For example, these environment variables can be used to configure dstack to use a proxy:

type: fleet
name: my-fleet

placement: cluster

env:
- HTTP_PROXY=http://proxy.example.com:80
- HTTPS_PROXY=http://proxy.example.com:80
- NO_PROXY=localhost,127.0.0.1

ssh_config:
  user: ubuntu
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 3.255.177.51
    - 3.255.177.52

Examples

New examples include:

Llama 3.1 recipes for inference and fine-tuning
Spark cluster setup
Ray cluster setup

Other

[Bugifx] Fix filtering offers by disk size by @jvstme in https://github.com/dstackai/dstack/pull/1517
[Bugifx] Run containers as root for all images by @r4victor in https://github.com/dstackai/dstack/pull/1499
[Docs] Document GCP permissions for volumes by @r4victor in https://github.com/dstackai/dstack/pull/1501
[Docs] Another batch of docs improvements #1497 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1498
[Bugfix] Fix creating TensorDock instances by @jvstme in https://github.com/dstackai/dstack/pull/1506
[Bugfix] Launch TensorDock instances with correct disk size by @jvstme in https://github.com/dstackai/dstack/pull/1508
[Bugfix] Set timeouts to TensorDock API requests by @jvstme in https://github.com/dstackai/dstack/pull/1509
[Docs] Update TensorDock setup instructions by @jvstme in https://github.com/dstackai/dstack/pull/1512
[Internal] Implement API endpoint for listing volumes across projects by @r4victor in https://github.com/dstackai/dstack/pull/1519
[Internal] Include Volume.deleted in the API by @r4victor in https://github.com/dstackai/dstack/pull/1520
[Docs] Update the Axolotl example #1493 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1494
[Internal] Print docker image pulling errors to shim.log by @jvstme in https://github.com/dstackai/dstack/pull/1503
[Feature] Add env setting to fleet config for on-prem fleets by @un-def in https://github.com/dstackai/dstack/pull/1505

Full changelog: https://github.com/dstackai/dstack/compare/0.18.8...0.18.9

dstack - 0.18.8

Published by r4victor 3 months ago

GCP volumes

#1477 added support for gcp volumes:

type: volume
name: my-gcp-volume
backend: gcp
region: europe-west1
size: 100GB

Previously, volumes were only supported for aws and runpod.

Major bugfixes

#1486 fixed a major bug introduced in 0.18.7 that could lead to instances not being terminated in the cloud.

Other

Update Alignment Handbook example by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1475
Add automatic generation of examples documentation by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1485
Start dstack-shim service after network-online by @un-def in https://github.com/dstackai/dstack/pull/1480
Remove host_info.json on instance deploy if exists by @un-def in https://github.com/dstackai/dstack/pull/1479
Fix broken user token rotation API by @r4victor in https://github.com/dstackai/dstack/pull/1487

New Contributors

@un-def made their first contribution in https://github.com/dstackai/dstack/pull/1480

Full Changelog: https://github.com/dstackai/dstack/compare/0.18.7...0.18.8

dstack - 0.18.7

Published by peterschmidt85 3 months ago

Fleets

With fleets, you can now describe clusters declaratively and create them in both cloud and on-prem with a single command. Once a fleet is created, it can be used with dev environments, tasks, and services.

Cloud fleets

To provision a fleet in the cloud, specify the required resources, number of nodes, and other optional parameters.

type: fleet
name: my-fleet
placement: cluster
nodes: 2
resources:
  gpu: 24GB

On-prem fleets

To create a fleet from on-prem servers, specify their hosts along with the user, port, and SSH key for connection via SSH.

type: fleet
name: my-fleet
placement: cluster
ssh_config:
  user: ubuntu
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 3.255.177.51
    - 3.255.177.52

To create or update the fleet, simply call the dstack apply command:

dstack apply -f examples/fleets/my-fleet.dstack.yml

Learn more about fleets in the documentation.

Deprecating `dstack run`

Now that we support dstack apply for gateways, volumes, and fleets, we have extended this support to dev environments, tasks, and services. Instead of using dstack run WORKING_DIR -f CONFIG_FILE, you can now use dstack apply -f CONFIG_FILE.

Also, it's now possible to specify a name for dev environments, tasks, and services, just like for gateways, volumes, and fleets.

type: dev-environment
name: my-ide

python: "3.11"

ide: vscode

resources:
  gpu: 80GB

This name is used as a run name and is more convenient than a random name. However, if you don't specify a name, dstack will assign a random name as before.

RunPod Volumes

In other news, we've added support for volumes in the runpod backend. Previously, they were only supported in the aws backend.

type: volume
name: my-new-volume

backend: runpod
region: ca-mtl-3
size: 100GB

A great feature of the runpod's volumes is their ability to attach to multiple instances simultaneously. This allows for persisting cache across multiple service replicas or supporting distributed training tasks.

Major bugfixes

[!IMPORTANT]
This update fixes the broken kubernetes backend, which has been non-functional since a few previous updates.

Other

[UX] Make --gpu override YAML's gpu by @r4victor in https://github.com/dstackai/dstack/pull/1455
https://github.com/dstackai/dstack/pull/1431
[Performance] Speed up listing runs for Python API and CLI by @r4victor in https://github.com/dstackai/dstack/pull/1430
[Performance] Speed up project loading by @r4victor in https://github.com/dstackai/dstack/pull/1425
[Bugfix] Remove busy offers from the top of offers list by @jvstme in https://github.com/dstackai/dstack/pull/1452
[Bugfix] Prioritize cheaper offers from the pool by @jvstme in https://github.com/dstackai/dstack/pull/1453
[Bugfix] Fix spot offers suggested for on-demand dev envs by @jvstme in https://github.com/dstackai/dstack/pull/1450
[Feature] Implement dstack volume delete by @r4victor in https://github.com/dstackai/dstack/pull/1434
[UX] Instances were always shown as provisioning for container backends by @r4victor in * [Docs] Fix typos by @jvstme in https://github.com/dstackai/dstack/pull/1426
[Docs] Fix a bad link by @tamanobi in https://github.com/dstackai/dstack/pull/1422
[Internal] Add DSTACK_SENTRY_PROFILES_SAMPLE_RATE by @r4victor in https://github.com/dstackai/dstack/pull/1428
[Internal] Update ruff to 0.5.3 by @jvstme in https://github.com/dstackai/dstack/pull/1421
[Internal] Update GitHub Actions dependencies by @jvstme in https://github.com/dstackai/dstack/pull/1436
[UX] Make --gpu override YAML's gpu: by @r4victor in https://github.com/dstackai/dstack/pull/1455
[Bugfix] Respect regions for runpod by @r4victor in https://github.com/dstackai/dstack/pull/1460

New contributors

@tamanobi made their first contribution in https://github.com/dstackai/dstack/pull/1422

** Full changelog**: https://github.com/dstackai/dstack/compare/0.18.6...0.18.7

dstack - 0.18.7rc2

Published by peterschmidt85 3 months ago

This is a preview build of the upcoming 0.18.7 update, bringing a few major new features and many bug fixes.

Fleets

[!IMPORTANT]
With fleets, you can now describe clusters declaratively and create them in both cloud and on-prem with a single command. Once a fleet is created, it can be used with dev environments, tasks, and services.

Cloud fleets

To provision a fleet in the cloud, specify the required resources, number of nodes, and other optional parameters.

type: fleet
name: my-fleet
placement: cluster
nodes: 2
resources:
  gpu: 24GB

On-prem fleets

To create a fleet from on-prem servers, specify their hosts along with the user, port, and SSH key for connection via SSH.

type: fleet
name: my-fleet
placement: cluster
ssh_config:
  user: ubuntu
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 3.255.177.51
    - 3.255.177.52

To create or update the fleet, simply call the dstack apply command:

dstack apply -f examples/fleets/my-fleet.dstack.yml

Learn more about fleets in the documentation.

Deprecating `dstack run`

[!IMPORTANT]
Now that we support dstack apply for gateways, volumes, and fleets, we have extended this support to dev environments, tasks, and services. Instead of using dstack run WORKING_DIR -f CONFIG_FILE, you can now use dstack apply -f CONFIG_FILE.

Also, it's now possible to specify a name for dev environments, tasks, and services, just like for gateways, volumes, and fleets.

type: dev-environment
name: my-ide

python: "3.11"

ide: vscode

resources:
  gpu: 80GB

This name is used as a run name and is more convenient than a random name. However, if you don't specify a name, dstack will assign a random name as before.

RunPod Volumes

[!IMPORTANT]
In other news, we've added support for volumes in the runpod backend. Previously, they were only supported in the aws backend.

type: volume
name: my-new-volume

backend: runpod
region: ca-mtl-3
size: 100GB

Major bugfixes

[!IMPORTANT]
This update fixes the broken kubernetes backend, which has been non-functional since a few previous updates.

Other

[UX] Make --gpu override YAML's gpu by @r4victor in https://github.com/dstackai/dstack/pull/1455
https://github.com/dstackai/dstack/pull/1431
[Performance] Speed up listing runs for Python API and CLI by @r4victor in https://github.com/dstackai/dstack/pull/1430
[Performance] Speed up project loading by @r4victor in https://github.com/dstackai/dstack/pull/1425
[Bugfix] Remove busy offers from the top of offers list by @jvstme in https://github.com/dstackai/dstack/pull/1452
[Bugfix] Prioritize cheaper offers from the pool by @jvstme in https://github.com/dstackai/dstack/pull/1453
[Bugfix] Fix spot offers suggested for on-demand dev envs by @jvstme in https://github.com/dstackai/dstack/pull/1450
[Feature] Implement dstack volume delete by @r4victor in https://github.com/dstackai/dstack/pull/1434
[UX] Instances were always shown as provisioning for container backends by @r4victor in * [Docs] Fix typos by @jvstme in https://github.com/dstackai/dstack/pull/1426
[Docs] Fix a bad link by @tamanobi in https://github.com/dstackai/dstack/pull/1422
[Internal] Add DSTACK_SENTRY_PROFILES_SAMPLE_RATE by @r4victor in https://github.com/dstackai/dstack/pull/1428
[Internal] Update ruff to 0.5.3 by @jvstme in https://github.com/dstackai/dstack/pull/1421
[Internal] Update GitHub Actions dependencies by @jvstme in https://github.com/dstackai/dstack/pull/1436

New contributors

@tamanobi made their first contribution in https://github.com/dstackai/dstack/pull/1422

** Full changelog**: https://github.com/dstackai/dstack/compare/0.18.6...0.18.7rc2

dstack - 0.18.6

Published by peterschmidt85 3 months ago

Major fixes

Support for GitLab's authorization when the repo is using HTTP/HTTPS by @jvstme in https://github.com/dstackai/dstack/pull/1412
Add a multi-node example to the Hugging Alignment Handbook example by @deep-diver in https://github.com/dstackai/dstack/pull/1409
Fix the issue where idle instances weren't offered (occurred when a GPU name was in upper case). by @jvstme in https://github.com/dstackai/dstack/pull/1417
Fix the issue where an exception is thrown for non-standard Git repo host URLs using HTTP/HTTPS @jvstme in https://github.com/dstackai/dstack/pull/1410
Support H100 with the gcp backend by @jvstme in https://github.com/dstackai/dstack/pull/1405

[!WARNING]
If you have idle instances in your pool, it is recommended to re-create them after upgrading to version 0.18.6. Otherwise, there is a risk that these instances won't be able to execute jobs.

Other

[Internal] Add script for checking OCI images by @jvstme in https://github.com/dstackai/dstack/pull/1408
Fix repos migration on PostgreSQL by @jvstme in https://github.com/dstackai/dstack/pull/1414
[Internal] Fix dstack-runner repo tests by @jvstme in https://github.com/dstackai/dstack/pull/1418
Fix OCI listing not found errors by @jvstme in https://github.com/dstackai/dstack/pull/1407

Full changelog: https://github.com/dstackai/dstack/compare/0.18.5...0.18.6

dstack - 0.18.5

Published by peterschmidt85 3 months ago

Read below about its new features and bug-fixes.

Volumes

When you run anything with dstack, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5, we're adding support for network volumes that allow data to persist across runs.

Once you've created a volume (e.g. named my-new-volume), you can attach it to a dev environment, task, or service.

type: dev-environment
ide: vscode
volumes:
  - name: my-new-volume
    path: /volume_data

The data stored in the volume will persist across runs.

dstack allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.

[!IMPORTANT]
Volumes are currently experimental and only work with the aws backend. Support for other backends is coming soon.

PostgreSQL

By default, dstack stores its state in ~/.dstack/server/data using SQLite. With this update, it's now possible to configure dstack to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL environment variable.

DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server

[!IMPORTANT]
Despite PostgreSQL support, dstack still requires that you run only one instance of the dstack server. However, this requirement will be lifted in a future update.

On-prem clusters

Previously, dstack didn't allow the use of on-prem clusters (added via dstack pool add-ssh) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.

Supported GPUs

Previously, dstack didn't support L4 and H100 GPUs with AWS. Now you can use them.

Full changelog

Support dstack volumes by @r4victor in https://github.com/dstackai/dstack/pull/1364
Filter pool instances with respect to volumes availability zone by @r4victor in https://github.com/dstackai/dstack/pull/1368
Support AWS L4 GPU by @jvstme in https://github.com/dstackai/dstack/pull/1365
Add Concepts->Volumes by @r4victor in https://github.com/dstackai/dstack/pull/1370
Improve Overview page by @r4victor in https://github.com/dstackai/dstack/pull/1377
Add volumes prices by @r4victor in https://github.com/dstackai/dstack/pull/1382
Wait for GCP VM no capacity error by @r4victor in https://github.com/dstackai/dstack/pull/1387
Disallow mounting volumes inside /workflow by @r4victor in https://github.com/dstackai/dstack/pull/1388
Support NVIDIA NVSwitch in dstack VM images by @jvstme in https://github.com/dstackai/dstack/pull/1389
Optimize loading dstack Docker images by @jvstme in https://github.com/dstackai/dstack/pull/1391
Improve Contributing by @r4victor in https://github.com/dstackai/dstack/pull/1392
Support running dstack server with Postgres by @r4victor in https://github.com/dstackai/dstack/pull/1398
Support H100 GPU on AWS by @jvstme in https://github.com/dstackai/dstack/pull/1394
Fix possible server freeze after pool add-ssh by @jvstme in https://github.com/dstackai/dstack/pull/1396
Add OCI eu-milan-1 region by @jvstme in https://github.com/dstackai/dstack/pull/1400
Prepare future OCI spot instances support by @jvstme in https://github.com/dstackai/dstack/pull/1401
Remove if backends configured check by @r4victor in https://github.com/dstackai/dstack/pull/1404
Include project_name in Instance and Volume by @r4victor in https://github.com/dstackai/dstack/pull/1390

See more: https://github.com/dstackai/dstack/compare/0.18.4...0.18.5

dstack - 0.18.5rc1

Published by peterschmidt85 3 months ago

This is a release candidate build of the upcoming 0.18.5 release. Read below to learn about its new features and bug-fixes.

Volumes

Once you've created a volume (e.g. named my-new-volume), you can attach it to a dev environment, task, or service.

type: dev-environment
ide: vscode
volumes:
  - name: my-new-volume
    path: /volume_data

The data stored in the volume will persist across runs.

dstack allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.

[!IMPORTANT]
Volumes are currently experimental and only work with the aws backend. Support for other backends is coming soon.

PostgreSQL

By default, dstack stores its state in /root/.dstack/server/data using SQLite. With this update, it's now possible to configure dstack to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL environment variable.

DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server

[!IMPORTANT]
Despite PostgreSQL support, dstack still requires that you run only one instance of the dstack server. However, this requirement will be lifted in a future update.

On-prem clusters

Supported GPUs

Previously, dstack didn't support L4 and H100 GPUs with AWS. Now you can use them.

Full changelog

Support dstack volumes by @r4victor in https://github.com/dstackai/dstack/pull/1364
Filter pool instances with respect to volumes availability zone by @r4victor in https://github.com/dstackai/dstack/pull/1368
Support AWS L4 GPU by @jvstme in https://github.com/dstackai/dstack/pull/1365
Add Concepts->Volumes by @r4victor in https://github.com/dstackai/dstack/pull/1370
Improve Overview page by @r4victor in https://github.com/dstackai/dstack/pull/1377
Add volumes prices by @r4victor in https://github.com/dstackai/dstack/pull/1382
Wait for GCP VM no capacity error by @r4victor in https://github.com/dstackai/dstack/pull/1387
Disallow mounting volumes inside /workflow by @r4victor in https://github.com/dstackai/dstack/pull/1388
Support NVIDIA NVSwitch in dstack VM images by @jvstme in https://github.com/dstackai/dstack/pull/1389
Optimize loading dstack Docker images by @jvstme in https://github.com/dstackai/dstack/pull/1391
Improve Contributing by @r4victor in https://github.com/dstackai/dstack/pull/1392
Support running dstack server with Postgres by @r4victor in https://github.com/dstackai/dstack/pull/1398
Support H100 GPU on AWS by @jvstme in https://github.com/dstackai/dstack/pull/1394
Fix possible server freeze after pool add-ssh by @jvstme in https://github.com/dstackai/dstack/pull/1396
Add OCI eu-milan-1 region by @jvstme in https://github.com/dstackai/dstack/pull/1400
Prepare future OCI spot instances support by @jvstme in https://github.com/dstackai/dstack/pull/1401
Remove if backends configured check by @r4victor in https://github.com/dstackai/dstack/pull/1404
Include project_name in Instance and Volume by @r4victor in https://github.com/dstackai/dstack/pull/1390

See more: https://github.com/dstackai/dstack/compare/0.18.4...0.18.5rc1

dstack - 0.18.4

Published by peterschmidt85 4 months ago

Google Cloud TPU

This update introduces initial support for Google Cloud TPU.

To request a TPU, specify the TPU architecture prefixed by tpu- (in gpu under resources):

type: task

python: "3.11"

commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

resources:
  gpu:  tpu-v2-8

[!IMPORTANT]
Currently, you can't specify other than 8 TPU cores. This means only single TPU device workloads are supported. Support for multiple TPU devices is coming soon.

Private subnets with GCP

Additionally, the update allows configuring the gcp backend to use only private subnets. To achieve this, set public_ips to false.

projects:
  - name: main
    backends:
      - type: gcp
        creds:
          type: default

        public_ips: false

Major bug-fixes

Besides TPU, the update fixes a few important bugs.

Fix cudo backend stuck && Improve docs for cudo by @smokfyz in https://github.com/dstackai/dstack/pull/1347
Fix nvidia-smi not available on lambda by @r4victor in https://github.com/dstackai/dstack/pull/1357
Respect registry_auth for RunPod by @smokfyz in https://github.com/dstackai/dstack/pull/1333
Support multi-node tasks on oci by @jvstme in https://github.com/dstackai/dstack/pull/1334

Other

Show warning on required ssh version by @loghijiaha in https://github.com/dstackai/dstack/pull/1313
Add OCI packer templates by @jvstme in https://github.com/dstackai/dstack/pull/1316
Support oci Bare Metal instances by @jvstme in https://github.com/dstackai/dstack/pull/1325
Support oci BM.Optimized3.36 instance by @jvstme in https://github.com/dstackai/dstack/pull/1328
[Docs] Update dstack pool docs by @jvstme in https://github.com/dstackai/dstack/pull/1329
Add TPU support in gcp by @Bihan in https://github.com/dstackai/dstack/pull/1323
Fix failing runner-test workflow by @r4victor in https://github.com/dstackai/dstack/pull/1336
Document OCI permissions by @jvstme in https://github.com/dstackai/dstack/pull/1338
Limit the gateway's open ports to 22, 80, and 443 by @smokfyz in https://github.com/dstackai/dstack/pull/1335
Update serve.dstack.yml - infinity by @michaelfeil in https://github.com/dstackai/dstack/pull/1340
Support instances without public IP for GCP by @smokfyz in https://github.com/dstackai/dstack/pull/1341
[Internal] Automate OCI images publishing by @jvstme in https://github.com/dstackai/dstack/pull/1346
Fix slow /api/pools/list_instances by @r4victor in https://github.com/dstackai/dstack/pull/1320
Respect gcp VPC config when provisioning TPUs by @r4victor in https://github.com/dstackai/dstack/pull/1332
[Internal] Fix linter errors by @jvstme in https://github.com/dstackai/dstack/pull/1322
TPU support enhancements by @r4victor in https://github.com/dstackai/dstack/pull/1330
TPU initial release by @Bihan in https://github.com/dstackai/dstack/pull/1354
TPUs fixes by @r4victor in https://github.com/dstackai/dstack/pull/1360
Minor refactoring to support custom backends in dstack Sky by @r4victor in https://github.com/dstackai/dstack/pull/1319
Even more flexible OCI client credentials by @jvstme in https://github.com/dstackai/dstack/pull/1317

New contributors

@loghijiaha made their first contribution in https://github.com/dstackai/dstack/pull/1313
@smokfyz made their first contribution in https://github.com/dstackai/dstack/pull/1333
@michaelfeil made their first contribution in https://github.com/dstackai/dstack/pull/1340

Full changelog: https://github.com/dstackai/dstack/compare/0.18.3...0.18.4

dstack - 0.18.4rc3

Published by peterschmidt85 4 months ago

This is a preview build of the upcoming 0.18.4 release. See below to see what's new.

TPU

One of the major new features in this update is the initial support for Google Cloud TPU.

To request a TPU, you simply need to specify the system architecture of the required TPU prefixed by tpu- in gpu:

type: task

python: "3.11"

commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

resources:
  gpu:  tpu-v2-8

[!IMPORTANT]
You cannot request multiple nodes (for running parallel on multiple TPU devices) for tasks. This feature is coming soon.

You're very welcome to try the initial support and share your feedback.

Major bug-fixes

Besides TPU, the update fixes a few important bugs.

Fix cudo backend stuck && Improve docs for cudo by @smokfyz in https://github.com/dstackai/dstack/pull/1347
Fix nvidia-smi not available on lambda by @r4victor in https://github.com/dstackai/dstack/pull/1357
Respect registry_auth for RunPod by @smokfyz in https://github.com/dstackai/dstack/pull/1333
Support multi-node tasks on oci by @jvstme in https://github.com/dstackai/dstack/pull/1334

Other

Show warning on required ssh version by @loghijiaha in https://github.com/dstackai/dstack/pull/1313
Add OCI packer templates by @jvstme in https://github.com/dstackai/dstack/pull/1316
Support oci Bare Metal instances by @jvstme in https://github.com/dstackai/dstack/pull/1325
Support oci BM.Optimized3.36 instance by @jvstme in https://github.com/dstackai/dstack/pull/1328
[Docs] Update dstack pool docs by @jvstme in https://github.com/dstackai/dstack/pull/1329
Add TPU support in gcp by @Bihan in https://github.com/dstackai/dstack/pull/1323
Fix failing runner-test workflow by @r4victor in https://github.com/dstackai/dstack/pull/1336
Document OCI permissions by @jvstme in https://github.com/dstackai/dstack/pull/1338
Limit the gateway's open ports to 22, 80, and 443 by @smokfyz in https://github.com/dstackai/dstack/pull/1335
Update serve.dstack.yml - infinity by @michaelfeil in https://github.com/dstackai/dstack/pull/1340
Support instances without public IP for GCP by @smokfyz in https://github.com/dstackai/dstack/pull/1341
[Internal] Automate OCI images publishing by @jvstme in https://github.com/dstackai/dstack/pull/1346
Fix slow /api/pools/list_instances by @r4victor in https://github.com/dstackai/dstack/pull/1320
Respect gcp VPC config when provisioning TPUs by @r4victor in https://github.com/dstackai/dstack/pull/1332
[Internal] Fix linter errors by @jvstme in https://github.com/dstackai/dstack/pull/1322
TPU support enhancements by @r4victor in https://github.com/dstackai/dstack/pull/1330
TPU initial release by @Bihan in https://github.com/dstackai/dstack/pull/1354
TPUs fixes by @r4victor in https://github.com/dstackai/dstack/pull/1360
Minor refactoring to support custom backends in dstack Sky by @r4victor in https://github.com/dstackai/dstack/pull/1319
Even more flexible OCI client credentials by @jvstme in https://github.com/dstackai/dstack/pull/1317

New contributors

@loghijiaha made their first contribution in https://github.com/dstackai/dstack/pull/1313
@smokfyz made their first contribution in https://github.com/dstackai/dstack/pull/1333
@michaelfeil made their first contribution in https://github.com/dstackai/dstack/pull/1340

Full changelog: https://github.com/dstackai/dstack/compare/0.18.3...0.18.4rc3

dstack - 0.18.3

Published by peterschmidt85 5 months ago

Oracle Cloud Infrastructure

With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci and can be configured as follows:

projects:
  - name: main
    backends:
      - type: oci
        creds:
          type: default

The supported credential types include default and client. In case default is used, dstack automatically picks the default OCI credentials from ~/.oci/config.

[!WARNING]
OCI support does not yet include spot instances, multi-node tasks, and gateways. These features are coming soon.

Retry policy

We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:

type: task

commands: 
  - python train.py

retry:
  on_events: [no-capacity]
  duration: 2h

Now, if you run such a task, dstack will keep trying to find capacity within 2 hours. Once capacity is found, dstack will run the task.

The on_events property also supports error (in case the run fails with an error) and interruption (if the run is using a spot instance and it was interrupted).

Previously, dstack only allowed retries when spot instances were interrupted.

RunPod

Previously, the runpod backend only allowed the use of Docker images with /bin/bash or /bin/sh as the entrypoint. Thanks to the fix on the RunPod's side, dstack now allows the use of any Docker images.

Additionally, the runpod backend now also supports spot instances.

GCP

The gcp backend now also allows configuring VPCs:

projects:
  - name: main
    backends:
      - type: gcp

        project_id: my-awesome-project
        creds:
          type: default

        vpc_name: my-custom-vpc

The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id.

AWS

Last but not least, for the aws backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:

projects:
  - name: main
    backends:
      - type: aws
        creds:
          type: default

        vpc_ids:
          us-east-1: vpc-0a2b3c4d5e6f7g8h

        default_vpcs: true

You just need to set default_vpcs to true.

Other changes

Fix reverse server-gateway ssh tunnel by @r4victor in https://github.com/dstackai/dstack/pull/1303
Respect run filters for the ssh backend by @r4victor in https://github.com/dstackai/dstack/pull/1278
Support resubmitted runs in dstack run attached mode by @r4victor in https://github.com/dstackai/dstack/pull/1285
Do not run jobs on unreachable instances by @r4victor in https://github.com/dstackai/dstack/pull/1286
Show job termination reason in dstack ps -v by @r4victor in https://github.com/dstackai/dstack/pull/1301
Rename dstack destroy to dstack delete by @r4victor in https://github.com/dstackai/dstack/pull/1275
Prepare OCI backend for release by @jvstme in https://github.com/dstackai/dstack/pull/1308
[Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1295
[Docs] Fix Authorization header by @jvstme in https://github.com/dstackai/dstack/pull/1305

dstack - 0.18.3rc1

Published by peterschmidt85 5 months ago

OCI

With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci and can be configured as follows:

projects:
  - name: main
    backends:
      - type: oci
        creds:
          type: default

The supported credential types include default and client. In case default is used, dstack automatically picks the default OCI credentials from ~/.oci/config.

[!WARNING]
OCI support does not yet include spot instances, multi-node tasks, and gateways. These features will be added in upcoming updates.

Retry policy

We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:

type: task

commands: 
  - python train.py

retry:
  on_events: [no-capacity]
  duration: 2h

Now, if you run such a task, dstack will keep trying to find capacity within 2 hours. Once capacity is found, dstack will run the task.

The on_events property also supports error (in case the run fails with an error) and interruption (if the run is using a spot instance and it was interrupted).

Previously, dstack only allowed retries when spot instances were interrupted.

VPC

GCP

The gcp backend now also allows configuring VPCs:

projects:
  - name: main
    backends:
      - type: gcp

        project_id: my-awesome-project
        creds:
          type: default

        vpc_name: my-custom-vpc

The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id.

AWS

Last but not least, for the aws backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:

projects:
  - name: main
    backends:
      - type: aws
        creds:
          type: default

        vpc_ids:
          us-east-1: vpc-0a2b3c4d5e6f7g8h

        default_vpcs: true

You just need to set default_vpcs to true.

Other changes

Fix reverse server-gateway ssh tunnel by @r4victor in https://github.com/dstackai/dstack/pull/1303
Respect run filters for the ssh backend by @r4victor in https://github.com/dstackai/dstack/pull/1278
Support resubmitted runs in dstack run attached mode by @r4victor in https://github.com/dstackai/dstack/pull/1285
Do not run jobs on unreachable instances by @r4victor in https://github.com/dstackai/dstack/pull/1286
Show job termination reason in dstack ps -v by @r4victor in https://github.com/dstackai/dstack/pull/1301
Rename dstack destroy to dstack delete by @r4victor in https://github.com/dstackai/dstack/pull/1275
Prepare OCI backend for release by @jvstme in https://github.com/dstackai/dstack/pull/1308
[Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1295

Full changelog: https://github.com/dstackai/dstack/compare/0.18.2...0.18.3rc1

[!WARNING]
This is an RC build. Please report any bugs to the issue tracker. The final release is planned for later this week, and the official documentation and examples will be updated then.

dstack - 0.18.2

Published by peterschmidt85 5 months ago

On-prem clusters

Network

The dstack pool add-ssh command now supports the --network argument. Use this argument if you want to use multiple instances that share the same private network as a cluster to run multi-node tasks.

The --network argument accepts the IP address range (CIDR) of the private network of the instance.

Example:

dstack pool add-ssh -i ~/.ssh/id_rsa [email protected] --network 10.0.0.0/24

Once you've added multiple instances with the same network value, you'll be able to use them as a cluster to run multi-node tasks.

Private subnets

By default, dstack uses public IPs for SSH access to running instances, requiring public subnets in the VPC. The new update allows AWS instances to use private subnets instead.

To create instances only in private subnets, set public_ips to false in the AWS backend settings:

type: aws
  creds:
    type: default
  vpc_ids:
     ...
  public_ips: false

[!NOTE]

Both dstack server and the dstack CLI should have access to the private subnet to access instances.

If you want running instances to access the Internet, the private subnets need to have a NAT gateway.

Gateways

`dstack apply`

Previously, to create or update gateways, one had to use the dstack gateway create or dstack gateway update commands.
Now, it's possible to define a gateway configuration via YAML and create or update it using the dstack apply command.

Example:

type: gateway
name: example-gateway

backend: gcp
region: europe-west1
domain: example.com

dstack apply -f examples/deployment/gateway.dstack.yml

For now, the dstack apply command only supports the gateway configuration type. Soon, it will also support dev-environment, task, and service, replacing the dstack run command.

The dstack destroy command can be used to delete resources.

Private gateways

By default, gateways are deployed using public subnets. Since 0.18.2, it is now possible to deploy gateways using private subnets. To do this, you need to set public_ips to false and specify the ARN of a certificate from AWS Certificate Manager.

type: gateway
name: example-gateway

backend: aws
region: eu-west-1
domain: "example.com"

public_ip: false
certificate:
  type: acm
  arn: "arn:aws:acm:eu-west-1:3515152512515:certificate/3251511125--1241-1224-121251515125"

In this case, dstack will deploy the gateway in a private subnet behind a load balancer using the specified certificate.

[!NOTE]
Private gateways are currently supported only for AWS.

What's changed

Support multi-node tasks with dstack pool add-ssh instances by @TheBits in https://github.com/dstackai/dstack/pull/1189
Fixed the JSON schema errors by @r4victor in https://github.com/dstackai/dstack/pull/1193
Support spot instances with runpod by @Bihan in https://github.com/dstackai/dstack/pull/1119
Speed up AWS VPC validation by @r4victor in https://github.com/dstackai/dstack/pull/1196
[Internal] Optimize ProjectModel loading by @r4victor in https://github.com/dstackai/dstack/pull/1199
Support provisioning instances without public IPs on AWS by @r4victor in https://github.com/dstackai/dstack/pull/1203
Minor improvements of dstack pool add-ssh by @TheBits in https://github.com/dstackai/dstack/pull/1202
Instances cannot be reused by other users by @TheBits in https://github.com/dstackai/dstack/pull/1204
Do not create AWS instance profile when launching instances by @r4victor in https://github.com/dstackai/dstack/pull/1212
Allow running services without https by @r4victor in https://github.com/dstackai/dstack/pull/1217
Implement dstack apply for gateways by @r4victor in https://github.com/dstackai/dstack/pull/1223
Support gateways without public IPs on AWS by @r4victor in https://github.com/dstackai/dstack/pull/1224
Support --network with dstack pool add-ssh by @TheBits in https://github.com/dstackai/dstack/pull/1225
[Internal] Make gateway creation async by @r4victor in https://github.com/dstackai/dstack/pull/1236
Using a more resourceful VM type by default for GCP gateway by @r4victor in https://github.com/dstackai/dstack/pull/1237
Handle properly if the network passed to dstack pool add-ssh is not correct by @TheBits in https://github.com/dstackai/dstack/pull/1233
Use valid GCP resource names by @r4victor in https://github.com/dstackai/dstack/pull/1248
Always try to restart dstack-shim.service with dstack pool add-ssh by @TheBits in https://github.com/dstackai/dstack/pull/1253
[Internal] Improve instance processing by @r4victor in https://github.com/dstackai/dstack/pull/1251
Changed dstack pool remove to rm by @muddi900 in https://github.com/dstackai/dstack/pull/1258
Support gateways behind ALB with ACM certificate by @r4victor in https://github.com/dstackai/dstack/pull/1264
Support IP addresses with --network by @TheBits in https://github.com/dstackai/dstack/pull/1263
[Internal] Fix double unlocking when processing runs and instances by @r4victor in https://github.com/dstackai/dstack/pull/1268
Add dstack destroy command and improve dstack apply by @r4victor in https://github.com/dstackai/dstack/pull/1271
Fix instances from pools ignoring regions by @r4victor in https://github.com/dstackai/dstack/pull/1272
Add the axolotl example by @deep-diver in https://github.com/dstackai/dstack/pull/1187

New Contributors

@muddi900 made their first contribution in https://github.com/dstackai/dstack/pull/1258

Full Changelog: https://github.com/dstackai/dstack/compare/0.18.1...0.18.2

dstack - 0.18.1

Published by peterschmidt85 6 months ago

On-prem servers

Now you can add your own servers as pool instances:

dstack pool add-ssh -i ~/.ssh/id_rsa [email protected]

[!NOTE]
The server should be pre-installed with CUDA 12.1 and NVIDIA Docker.

Configuration

All .dstack/profiles.yml properties now can be specified via run configurations:

type: dev-environment

ide: vscode

spot_policy: auto
backends: ["aws"]

regions: ["eu-west-1", "eu-west-2"]

instance_types: ["p3.8xlarge", "p3.16xlarge"]
max_price: 2.0

max_duration: 1d

New examples 🔥🔥

Thanks to the contribution from @deep-diver, we got two new examples:

Other

Configuring VPCs using their IDs (via vpc_ids in server/config.yml)
Support for global profiles (via ~/.dstack/profiles.yml)
Updated the default environment variables (DSTACK_RUN_NAME, DSTACK_GPUS_NUM, DSTACK_NODES_NUM, DSTACK_NODE_RANK, and DSTACK_MASTER_NODE_IP)
It’s now possible to use NVIDIA A10 GPU on Azure
More granular permissions for Azure

What's changed

Fix server freeze on terminate instance by @jvstme in https://github.com/dstackai/dstack/pull/1132
Support profile params in run configurations by @r4victor in https://github.com/dstackai/dstack/pull/1131
Support global .dstack/profiles.yml by @r4victor in https://github.com/dstackai/dstack/pull/1134
Fix No such profile: None when missing .dstack/profiles.yml by @r4victor in https://github.com/dstackai/dstack/pull/1135
Make Azure permissions more granular by @r4victor in https://github.com/dstackai/dstack/pull/1139
Validate min disk size by @r4victor in https://github.com/dstackai/dstack/pull/1146
Fix unexpected error if system Python version is unknown by @r4victor in https://github.com/dstackai/dstack/pull/1147
Add request timeouts to prevent code freezes by @jvstme in https://github.com/dstackai/dstack/pull/1140
Refactor backends to wait for instance IP address outside run_job/create_instance by @r4victor in https://github.com/dstackai/dstack/pull/1149
Fix provisioning Azure instances with A10 GPU by @jvstme in https://github.com/dstackai/dstack/pull/1150
[Internal] Move packer -> scripts/packer by @jvstme in https://github.com/dstackai/dstack/pull/1153
Added the ability of adding own instances by @TheBits in https://github.com/dstackai/dstack/pull/1115
An issue with the executor_error check being falsely positive by @TheBits in https://github.com/dstackai/dstack/pull/1160
Make user project quota configurable by @r4victor in https://github.com/dstackai/dstack/pull/1161
Configure CORS headers on gateway by @r4victor in https://github.com/dstackai/dstack/pull/1166
Allow to configure AWS vpc_ids by @r4victor in https://github.com/dstackai/dstack/pull/1170
[Internal] Show dstack version in Sentry issues by @jvstme in https://github.com/dstackai/dstack/pull/1167
Fix KeyError: 'IpPermissions' when using AWS by @jvstme in https://github.com/dstackai/dstack/pull/1169
Create public ssh key is it not exist in dstack pool add-ssh by @TheBits in https://github.com/dstackai/dstack/pull/1173
Fixed is the environment file upload by @TheBits in https://github.com/dstackai/dstack/pull/1175
Updated shim status processing by @TheBits in https://github.com/dstackai/dstack/pull/1174
Fix bugs in dstack pool add-ssh by @TheBits in https://github.com/dstackai/dstack/pull/1178
Fix Cudo Create VM response error by @Bihan in https://github.com/dstackai/dstack/pull/1179
Implement API for configuring backends via yaml by @r4victor in https://github.com/dstackai/dstack/pull/1181
Allow running gated models with HUGGING_FACE_HUB_TOKEN by @r4victor in https://github.com/dstackai/dstack/pull/1184
Pass all dstack runner envs as DSTACK_* by @r4victor in https://github.com/dstackai/dstack/pull/1185
Improve the retries in the get_host_info and get_shim_healthcheck by @TheBits in https://github.com/dstackai/dstack/pull/1183
Example/h4alignment handbook by @deep-diver in https://github.com/dstackai/dstack/pull/1180
The deploy is launched in ThreadPoolExecutor by @TheBits in https://github.com/dstackai/dstack/pull/1186

Full Changelog: https://github.com/dstackai/dstack/compare/0.18.0...0.18.1rc2

dstack - dstack 0.18.0rc3: RunPod integration, multi-node tasks, and more

Published by peterschmidt85 6 months ago

This is a preview of the upcoming 0.18.0 update. Read below to see what improvements it brings.

RunPod integration

The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.

To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml:

projects:
- name: main
  backends:
  - type: runpod
    creds:
      type: api_key
      api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9

Once the server is restarted, go ahead and run workloads:

Multi-node tasks

Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.

Simply specify the nodes property for your task (to the number of required nodes) and run it.

type: task

nodes: 2

commands:
  - git clone https://github.com/r4victor/pytorch-distributed-resnet.git
  - cd pytorch-distributed-resnet
  - mkdir -p data
  - cd data
  - wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
  - tar -xvzf cifar-10-python.tar.gz
  - cd ..
  - pip3 install -r requirements.txt torch
  - mkdir -p saved_models
  - torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE 
     --node_rank=$DSTACK_NODE_RANK 
     --nnodes=$DSTACK_NODES_NUM
     --master_addr=$DSTACK_MASTER_NODE_IP
     --master_port=8008 resnet_ddp.py 
     --num_epochs 20

resources:
  gpu: 1

Currently supported providers for this feature include AWS, GCP, and Azure. For other providers or on-premises servers, file the corresponding feature requests or ping on Discord.

Optional commands

One more small improvement is that the commands property is now not required for tasks and services if you use an image that has a default entrypoint configured.

type: task

image: r8.im/bytedance/sdxl-lightning-4step

ports:
  - 5000

resources:
  gpu: 24GB

Server output

The update also improves the output of the dstack server command:

GCP permissions

Last but not least, we've made the permissions required for using dstack with GCP more granular.

compute.disks.create
compute.firewalls.create
compute.images.useReadOnly
compute.instances.create
compute.instances.delete
compute.instances.get
compute.instances.setLabels
compute.instances.setMetadata
compute.instances.setTags
compute.networks.updatePolicy
compute.regions.list
compute.subnetworks.use
compute.subnetworks.useExternalIp
compute.zoneOperations.get

What's changed

Add username filter to /api/runs/list by @r4victor in https://github.com/dstackai/dstack/pull/1068
Inherit core models from DualBaseModel by @r4victor in https://github.com/dstackai/dstack/pull/967
Fixed the YAML schema validation for replicas by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1055
Improve the server/config.yml reference documentation by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1077
Add the runpod backend by @Bihan in https://github.com/dstackai/dstack/pull/1063
Support JSON log handler by @TheBits in https://github.com/dstackai/dstack/pull/1085
Added lock to the terminate_idle_instance by @TheBits in https://github.com/dstackai/dstack/pull/1081
dstack init doesn't work with a remote Git repo by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1090
Minor improvements of dstack server output by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1088
Return an error information from dstack-shim by @TheBits in https://github.com/dstackai/dstack/pull/1061
Replace RetryPolicy.limit to RetryPolicy.duration by @TheBits in https://github.com/dstackai/dstack/pull/1074
Make dstack version configurable when deploying docs by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1095
dstack init doesn't work with a local Git repo by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1096
Fix infinite create_instance() on the cudo provider by @r4victor in https://github.com/dstackai/dstack/pull/1082
Do not update the latest Docker image and YAML scheme for pre-release builds by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1099
Support multi-node tasks by @r4victor in https://github.com/dstackai/dstack/pull/1103
Make commands optional in run configurations by @jvstme in https://github.com/dstackai/dstack/pull/1104
Allow the cudo backend use non-gpu instances by @Bihan in https://github.com/dstackai/dstack/pull/1092
Make GCP permissions more granular by @r4victor in https://github.com/dstackai/dstack/pull/1107

Full changelog: https://github.com/dstackai/dstack/compare/0.17.0...0.18.0rc3

Feedback

This is not the final release yet. If you encounter any bugs, report them directly via issues, or on our Discord.

dstack - dstack 0.17.0: Service auto-scaling, and other improvements

Published by peterschmidt85 7 months ago

The latest update previews service replicas and auto-scaling, and brings many other improvements.

Service auto-scaling

Previously, dstack always served services as single replicas. While this is suitable for development, in production, the service must automatically scale based on the load.

That's why in 0.17.0, we extended dstack with the capability to configure replicas (the number of replicas) as well as scaling (the auto-scaling policy).

Regions and instance types

The update brings support for specifying regions and instance types (in dstack run and .dstack/profiles.yml)

Environment variables

Firstly, it's now possible to configure an environment variable in the configuration without hardcoding its value. Secondly, dstack run now inherits environment variables from the current process.

For more details on these new features, check the changelog.

What's changed

Support running multiple replicas for a service by @Egor-S in https://github.com/dstackai/dstack/pull/986 and https://github.com/dstackai/dstack/pull/1015
Allow to specify instance_type via CLI and profiles by @r4victor in https://github.com/dstackai/dstack/pull/1023
Allow to specify regions via CLI and profiles by @r4victor in https://github.com/dstackai/dstack/pull/947
Allow specifying required env variables by @spott in https://github.com/dstackai/dstack/pull/1003
Allow configuring CA for gateways by @jvstme in https://github.com/dstackai/dstack/pull/1022
Support Python 3.12 by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1031
The shm_size property in resources doesn't take effect by @peterschmidt85 in https://github.com/dstackai/dstack/pull/1007
Sometimes, runs get stuck at pulling by @TheBits in https://github.com/dstackai/dstack/pull/1035
vastai doesn't show any offers since 0.16.0 by @iRohith in https://github.com/dstackai/dstack/pull/959
It's not possible to configure projects other than main by @peterschmidt85 in https://github.com/dstackai/dstack/pull/992
Spot instances don't work on GCP by @peterschmidt85 in https://github.com/dstackai/dstack/pull/996

New contributors

@iRohith made their first contribution in https://github.com/dstackai/dstack/pull/959
@Bihan made their first contribution in https://github.com/dstackai/dstack/pull/928

Full changelog: https://github.com/dstackai/dstack/compare/0.16.5...0.17.0

Package Rankings

Top 5.91% on Pypi.org

Top 23.47% on Repo1.maven.org

Top 5.21% on Proxy.golang.org

Badges

Extracted from project README

Related Projects

runhouse

The fastest way to iterate and deploy AI workloads on your own infra. Unobtrusive, debuggable, Py...

10 May 2022 707

localstack

💻 A fully functional local AWS cloud stack. Develop and test your cloud & Serverless apps offline

25 Oct 2016 55,764

DocsGPT

GPT-powered chat for documentation, chat with your documents

02 Feb 2023 14,124

cloudstack-kubernetes-provider

Apache Cloudstack Kubernetes Provider

11 Jul 2019 38

st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident respon...

23 Apr 2014 5,895

dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workf...

12 Apr 2023 47,374

devops-resources

DevOps resources - Linux, Jenkins, AWS, SRE, Prometheus, Docker, Python, Ansible, Git, Kubernetes...

07 Jan 2019 8,538

dstack

Features

Major bugfixes

Other changes

Features

Major bugfixes

Other changes

AMD

GPU vendors

Encryption

Storing logs in AWS CloudWatch

Project manager role

Default permissions

Other

New contributors

AMD

Other

New contributors

Control plane UI

Environment variables interpolation

Network interfaces for port forwarding

Full changelog

Base Docker image with nvcc

Environment variables for on-prem fleets

Examples

Other

GCP volumes

Major bugfixes

Other

New Contributors

Fleets

Cloud fleets

On-prem fleets

Deprecating dstack run

RunPod Volumes

Major bugfixes

Other

New contributors

Fleets

Cloud fleets

On-prem fleets

Deprecating dstack run

RunPod Volumes

Major bugfixes

Other

New contributors

Major fixes

Other

Volumes

PostgreSQL

On-prem clusters

Supported GPUs

Full changelog

Volumes

PostgreSQL

On-prem clusters

Supported GPUs

Full changelog

Google Cloud TPU

Private subnets with GCP

Major bug-fixes

Other

New contributors

TPU

Major bug-fixes

Other

New contributors

Oracle Cloud Infrastructure

Retry policy

RunPod

GCP

AWS

Other changes

OCI

Retry policy

VPC

GCP

AWS

Other changes

On-prem clusters

Base Docker image with `nvcc`

Deprecating `dstack run`

Deprecating `dstack run`

`dstack apply`