ScaleLLM | Cuda Ecosystem Directory

Bot releases are visible (Hide)

ScaleLLM - v0.1.3 Latest Release

Published by github-actions[bot] 5 months ago

Major changes

Model arg hotfix for llama3
Added more help functions

What's Changed

fix: load vocab_size first then use it to decide model type for model sharing between llama3, llama2 and Yi. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/230
feat: added with statement support to release memory and exposed help function for tokenizer by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/231

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.1.2...v0.1.3

ScaleLLM - v0.1.2

Published by github-actions[bot] 5 months ago

Major changes

set up github pages for docs https://docs.vectorch.com/
set up whl repository to host published whls: https://whl.vectorch.com/
support pip install with different versions: for example: pip install scalellm -i https://whl.vectorch.com/cu121/torch2.3/
added latency and system metrics
added initial monitoring dashboard.
bug fix for decoder, rejection sampler, and default value for llama2

What's Changed

ci: added workflow to publish docs to GitHub Pages by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/206
docs: added docs skeleton by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/207
docs: fixed source directory and added announcement by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/208
feat: added monitoring docker compose for prometheus and grafana by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/209
feat: Added prometheus metrics by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/210
feat: added token related latency metrics by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/211
fix: fix weight load issue for fused qkv and added more unittests for weight loading by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/213
fix: use a consistent version for whl by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/214
refactor: move setup.py to top level by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/217
feat: carry over prompt to output for feature parity by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/218
added missing changes for carrying over prompt by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/219
fix: set correct default value of rope_theta for llama2 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/223
feat: convert pickle to safetensors for fast loading by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/224
docs: add livehtml for docs development by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/225
fix: use error instead of CHECK when prompt input is empty by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/226
fix: avoid tensor convertion for converted ones. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/228
feat: added time_to_first_token and inter_token metrics for both stream and non-stream requests by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/227
fix: decode ending tokens one by one to handle unfinished tokens by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/229

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.1.1...v0.1.2

ScaleLLM - v0.1.1

Published by github-actions[bot] 5 months ago

What's Changed

[feat] added cuda 11.8 devel image to build cpp release image by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/194
[fix] fix workflow format by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/195
[CI] fix docker run options by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/196
fix: make build pass with gcc-9 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/197
ci: bump version and build with new manylinux image (gcc-9) by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/198
[python] added more examples and fix requirments version by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/199
feat: moved scheduler wait logic from python into scheduler run_until_complete function by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/200
feat: added multiple threads support for LLMHandler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/201
fix: use a proper epsilon to avoid division by zero error for rejection sampler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/202
feat: added batch support for llm handler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/204
ci: publish wheels to whl index repo by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/205

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.1.0...v0.1.1

ScaleLLM - v0.1.0

Published by github-actions[bot] 5 months ago

Major changes:

Added python wrapper and published scalellm package to PyPI.
Supported openai-compatible rest api server. 'python3 -m scalellm.serve.api_server'
Install scalellm with pip: 'pip install scalellm'
Added examples for offline inference and async stream.

What's Changed

[fix] use the pybind11 from libtorch and fix model download issue. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/167
[misc] upgrade torch to 2.3 and use gcc-12 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/168
[feat] added python rest api server skeleton by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/169
[refactor] combine sequence and request outputs by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/170
[feat] added python LLMEngine skeleton by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/171
[refactor] move proto definitions into proto namespace by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/173
[feat] implement async llm engine for python wrapper by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/172
[refactor] consolidate handlers to share llm_handler between python rest api server and grpc server by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/174
[python] move request handling logic into seperate file from api server by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/175
[python] added model check for rest api by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/176
[feat] added status handling for grpc server by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/177
[misc] some changes to cmake file by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/180
[kernle] change head_dim list to reduce binary size by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/181
[CI] added base docker image for python wheel build by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/182
[ci] build python wheels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/183
[CI] fix docker image issues and build wheel for different python, pytorch versions by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/184
[fix] added manylinux support by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/185
[fix] added cuda 11.8 support for manylinux by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/186
[feat] added version suffix to include cuda and torch version by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/187
[CI] Upload wheels to release as asserts by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/188
[fix] fix extension typo for wheel publish workflow by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/189
[python] added LLM for offline inference and stream examples for chat and complete by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/190
[python] added requirements into package by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/191
[Release] prepare 0.1.0 release by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/192
[Release] added workflow to publish whls to PyPI by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/193

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.9...v0.1.0

ScaleLLM - v0.0.9

Published by guocuimi 6 months ago

Major Changes

Enabled speculative decoding and updated README

What's Changed

[refactor] add implicit conversion between slice and vector by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/134
[refactor] change tokenizer special tokens from token to token + id. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/135
[feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/137
[refactor] refactoring for sequence by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/140
[unittest] added more unittests for speculative decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/141
[unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/142
[feat] added support for kv_cache with different strides. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/143
[feat] enable speculative decoding and update readme by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/145

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.8...v0.0.9

ScaleLLM - v0.0.8

Published by guocuimi 6 months ago

Major changes

Added Meta Llama3 and Google Gemma support
Added cuda graph support for decoding

What's Changed

[model] added support for google Gemma-2b model by @936187425 in https://github.com/vectorch-ai/ScaleLLM/pull/103
[feat] added rms norm residual kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/125
[fix] fix data accuracy issue for gemma by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/126
[refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/127
[feat] enable cuda graph for decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/129
[bugfix] fix cuda graph capture issue for tensor parallelism by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/130
[feat] optimize batch size for cuda graph by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/132

New Contributors

@936187425 made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/103

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.7...v0.0.8

ScaleLLM - v0.0.7

Published by guocuimi 7 months ago

Major changes

Dynamic prefix cache
Dynamic split-fuse scheduler
Speculative decoding

What's Changed

[feat] add support for cudagraph and its unit test. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/79
[feat] add block id lifecycle management for block sharing scenarios. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/85
[feat] added prefix cache to share kv cache across sequences. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/86
[feat] enable prefix cache in block manager by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/87
[feat] added LRU policy into prefix cache. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/89
[refactor] move batch related logic into a class by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/90
[fix] replace submodules git path with https path to avoid permission issue. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/92
[feat] add max tokens to process to support dynamic split-fuse by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/93
[feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/94
[fix] added small page size support for flash attention. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/95
[fix] adjust kv_cache_pos to give at least one token to generate logits by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/96
added layernorm benchmark by @dongxianzhe in https://github.com/vectorch-ai/ScaleLLM/pull/97
[feat] added dynamic split-fuse support in continuous scheduler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/98
[refactor] move model output process logic into batch by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/99
[feat] added engine type to allow LLM and SSM share sequence. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/100
[feat] added speculative engine class without implementation. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/101
[refactor] moved top_k and top_p from sampler to logits process. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/102
[workflow] added clang-format workflow by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/105
[fix] only run git-clang-format agains c/c++ files by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/106
[feat] added prompt blocks sharing across n sequences by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/107
[feat] Added selected tokens to return logits from model execution. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/109
[feat] added rejection sampler for speculative decoding. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/112
[feat] enable speculative decoding for simple server by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/113
[feat] mask out rejected tokens with -1 in Rejection Sampler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/114
[feat] added sampling support for multiple query decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/115
[feat] added stream support for n > 1 scenarios by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/116
[feat] enable speculative decoding for scalellm. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/117
[feat] cancel request if rpc is not ok by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/118
[fix] put finish reason into a separate response by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/119
[feat] added skip_special_tokens support for tokenizers by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/120

New Contributors

@dongxianzhe made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/97

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.6...v0.0.7

ScaleLLM - v0.0.6

Published by guocuimi 7 months ago

Major changes:

Introduced new kernels aimed at enhancing efficiency.
Implemented an initial Python wrapper, simplifying integration and extending accessibility.
Incorporated new models such as Baichuan2 and ChatGLM.
Added support for Jinja chat templates, enhancing customization and user interaction.
Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
Enabled ccache to accelerate build speed, facilitating quicker development cycles.

What's Changed

add timestamp into ccache cache key by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/42
use ${GITHUB_SHA} in cache key by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/43
replace GITHUB_SHA with ${{ github.sha }} by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/44
encapsulate class of time for performance tracking. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/46
upgrade paged_atten kernel to v0.2.7 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/47
[feat] add speculative decoding. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/50
added a new attention kernel for speculative decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/52
added support for small page size. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/53
enable flash decoding for both prefill and decode phase. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/54
enable split-k for flash decoding and fix bugs. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/59
[ut] add unit tests for speculative scheduler. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/57
added a custom command to generate instantiation for flashinfer by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/61
add custom command to generate instantiation for flash-attn by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/62
added gpu memory profiling to decided kv cache size precisely. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/63
moved attention related files into attention subfolder by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/65
add pybind11 to support python user interface. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/64
added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/66
merge huggingface tokenizers and safetensors rust projects into one. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/67
more changes to support python wrapper by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/68
[feat] added attention handler for different implementations by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/71
[perf] enabled speed up for gpa and mqa decoding. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/72
[perf] use a seperate cuda stream for kv cache by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/73
[models] added baichuan/baichuan2 model support. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/70
[minor] cleanup redundant code for models. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/74
[feat] moved rope logic into attention handler to support apply positional embeding on the fly by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/76
[refactor] replace dtype and device with options since they are used together usually by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/77
[refactor] move cutlass and flashinfer into third_party folder by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/78
[refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/80
[models] support both baichuan and baichuan2 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/81
[models] fix chatglm model issue. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/82

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.5...v0.0.6

ScaleLLM - v0.0.5

Published by guocuimi 10 months ago

Major changes

Added Qwen, ChatGLM and Phi2 support.
Added tiktoken tokenizer support.
Enabled more custom kernels for sampling.

What's Changed

[docs] add speculative decoding design docs. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/33
[docs] add devel image in CONTRIBUTING.md. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/35
[refactor] rename Executor to ThreadPool. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/36

New Contributors

@liutongxuan made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/33

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.4...v0.0.5

ScaleLLM - v0.0.4

Published by guocuimi 11 months ago

Major change:

Added docker image build for cuda 11.8.
Added exception handling logic in http server.

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.3-fix...v0.0.4

ScaleLLM - v0.0.3

Published by guocuimi 11 months ago

Added support for Yi Chat Model.
Added args overrider support.
Replaced libevhtp with boost asio for http server to fix epoll_wait not implemented error on old linux kernels.

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.2...v0.0.3-fix

ScaleLLM - v0.0.2

Published by guocuimi 12 months ago