ScaleLLM | Cuda Ecosystem Directory

Bot releases are hidden (Show)

ScaleLLM - v0.0.9

Published by guocuimi 6 months ago

Major Changes

Enabled speculative decoding and updated README

What's Changed

[refactor] add implicit conversion between slice and vector by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/134
[refactor] change tokenizer special tokens from token to token + id. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/135
[feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/137
[refactor] refactoring for sequence by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/140
[unittest] added more unittests for speculative decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/141
[unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/142
[feat] added support for kv_cache with different strides. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/143
[feat] enable speculative decoding and update readme by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/145

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.8...v0.0.9

ScaleLLM - v0.0.8

Published by guocuimi 6 months ago

Major changes

Added Meta Llama3 and Google Gemma support
Added cuda graph support for decoding

What's Changed

[model] added support for google Gemma-2b model by @936187425 in https://github.com/vectorch-ai/ScaleLLM/pull/103
[feat] added rms norm residual kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/125
[fix] fix data accuracy issue for gemma by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/126
[refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/127
[feat] enable cuda graph for decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/129
[bugfix] fix cuda graph capture issue for tensor parallelism by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/130
[feat] optimize batch size for cuda graph by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/132

New Contributors

@936187425 made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/103

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.7...v0.0.8

ScaleLLM - v0.0.7

Published by guocuimi 7 months ago

Major changes

Dynamic prefix cache
Dynamic split-fuse scheduler
Speculative decoding

What's Changed

[feat] add support for cudagraph and its unit test. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/79
[feat] add block id lifecycle management for block sharing scenarios. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/85
[feat] added prefix cache to share kv cache across sequences. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/86
[feat] enable prefix cache in block manager by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/87
[feat] added LRU policy into prefix cache. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/89
[refactor] move batch related logic into a class by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/90
[fix] replace submodules git path with https path to avoid permission issue. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/92
[feat] add max tokens to process to support dynamic split-fuse by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/93
[feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/94
[fix] added small page size support for flash attention. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/95
[fix] adjust kv_cache_pos to give at least one token to generate logits by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/96
added layernorm benchmark by @dongxianzhe in https://github.com/vectorch-ai/ScaleLLM/pull/97
[feat] added dynamic split-fuse support in continuous scheduler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/98
[refactor] move model output process logic into batch by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/99
[feat] added engine type to allow LLM and SSM share sequence. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/100
[feat] added speculative engine class without implementation. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/101
[refactor] moved top_k and top_p from sampler to logits process. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/102
[workflow] added clang-format workflow by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/105
[fix] only run git-clang-format agains c/c++ files by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/106
[feat] added prompt blocks sharing across n sequences by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/107
[feat] Added selected tokens to return logits from model execution. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/109
[feat] added rejection sampler for speculative decoding. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/112
[feat] enable speculative decoding for simple server by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/113
[feat] mask out rejected tokens with -1 in Rejection Sampler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/114
[feat] added sampling support for multiple query decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/115
[feat] added stream support for n > 1 scenarios by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/116
[feat] enable speculative decoding for scalellm. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/117
[feat] cancel request if rpc is not ok by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/118
[fix] put finish reason into a separate response by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/119
[feat] added skip_special_tokens support for tokenizers by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/120

New Contributors

@dongxianzhe made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/97

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.6...v0.0.7

ScaleLLM - v0.0.6

Published by guocuimi 7 months ago

Major changes:

Introduced new kernels aimed at enhancing efficiency.
Implemented an initial Python wrapper, simplifying integration and extending accessibility.
Incorporated new models such as Baichuan2 and ChatGLM.
Added support for Jinja chat templates, enhancing customization and user interaction.
Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
Enabled ccache to accelerate build speed, facilitating quicker development cycles.

What's Changed

add timestamp into ccache cache key by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/42
use ${GITHUB_SHA} in cache key by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/43
replace GITHUB_SHA with ${{ github.sha }} by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/44
encapsulate class of time for performance tracking. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/46
upgrade paged_atten kernel to v0.2.7 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/47
[feat] add speculative decoding. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/50
added a new attention kernel for speculative decoding by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/52
added support for small page size. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/53
enable flash decoding for both prefill and decode phase. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/54
enable split-k for flash decoding and fix bugs. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/59
[ut] add unit tests for speculative scheduler. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/57
added a custom command to generate instantiation for flashinfer by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/61
add custom command to generate instantiation for flash-attn by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/62
added gpu memory profiling to decided kv cache size precisely. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/63
moved attention related files into attention subfolder by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/65
add pybind11 to support python user interface. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/64
added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/66
merge huggingface tokenizers and safetensors rust projects into one. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/67
more changes to support python wrapper by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/68
[feat] added attention handler for different implementations by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/71
[perf] enabled speed up for gpa and mqa decoding. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/72
[perf] use a seperate cuda stream for kv cache by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/73
[models] added baichuan/baichuan2 model support. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/70
[minor] cleanup redundant code for models. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/74
[feat] moved rope logic into attention handler to support apply positional embeding on the fly by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/76
[refactor] replace dtype and device with options since they are used together usually by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/77
[refactor] move cutlass and flashinfer into third_party folder by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/78
[refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/80
[models] support both baichuan and baichuan2 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/81
[models] fix chatglm model issue. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/82

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.5...v0.0.6

ScaleLLM - v0.0.5

Published by guocuimi 10 months ago

Major changes

Added Qwen, ChatGLM and Phi2 support.
Added tiktoken tokenizer support.
Enabled more custom kernels for sampling.

What's Changed

[docs] add speculative decoding design docs. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/33
[docs] add devel image in CONTRIBUTING.md. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/35
[refactor] rename Executor to ThreadPool. by @liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/36

New Contributors

@liutongxuan made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/33

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.4...v0.0.5

ScaleLLM - v0.0.4

Published by guocuimi 11 months ago

Major change:

Added docker image build for cuda 11.8.
Added exception handling logic in http server.

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.3-fix...v0.0.4

ScaleLLM - v0.0.3

Published by guocuimi 11 months ago

Added support for Yi Chat Model.
Added args overrider support.
Replaced libevhtp with boost asio for http server to fix epoll_wait not implemented error on old linux kernels.

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.2...v0.0.3-fix

ScaleLLM - v0.0.2

Published by guocuimi 12 months ago