TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
APACHE-2.0 License
Bot releases are hidden (Show)
Published by Shixiaowei02 about 2 months ago
Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
ModelWeightsLoader
is enabled for LLaMA family models (experimental), see docs/source/architecture/model-weights-loader.md
.LLM
class.docs/source/speculative_decoding.md
.gelu_pytorch_tanh
activation function, thanks to the contribution from @ttim in #1897.chunk_length
parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in #1909.concurrency
argument for gptManagerBenchmark
.docs/source/executor.md#sending-requests-with-different-beam-widths
.--fast_build
to trtllm-build
command (experimental).max_output_len
is removed from trtllm-build
command, if you want to limit sequence length on engine build stage, specify max_seq_len
.use_custom_all_reduce
argument is removed from trtllm-build
.multi_block_mode
argument is moved from build stage (trtllm-build
and builder API) to the runtime.context_fmha_fp32_acc
is moved to runtime for decoder models.tp_size
, pp_size
and cp_size
is removed from trtllm-build
command.executor
API, and it will be removed in a future release of TensorRT-LLM.cpp/include/tensorrt_llm/executor/version.h
file is going to be generated.examples/exaone/README.md
.examples/chatglm/README.md
.examples/multimodal/README.md
.cluster_infos
defined in tensorrt_llm/auto_parallel/cluster_info.py
, thanks to the contribution from @saeyoonoh in #1987.docs/source/reference/troubleshooting.md
, thanks for the contribution from @hattizai in #1937.exclude_modules
to weight-only quantization, thanks to the contribution from @fjosw in #2056.max_seq_len
is not an integer. (#2018)nvcr.io/nvidia/pytorch:24.07-py3
.nvcr.io/nvidia/tritonserver:24.07-py3
.OSError: exception: access violation reading 0x0000000000000000
when importing the library in Python. See Installing on Windows for workarounds.Currently, there are two key branches in the project:
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by kaiyux 3 months ago
Hi,
We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:
examples/llama/README.md
).examples/qwen/README.md
.examples/phi/README.md
.examples/gpt/README.md
.distil-whisper/distil-large-v3
, thanks to the contribution from @IbrahimAmin1 in #1337.numQueuedRequests
to the iteration stats log of the executor API.iterLatencyMilliSec
to the iteration stats log of the executor API.trtllm-build
command
trtllm-build
command), see documents: examples/whisper/README.md.max_batch_size
in trtllm-build
command is switched to 256 by default.max_num_tokens
in trtllm-build
command is switched to 8192 by default.max_output_len
and added max_seq_len
.--weight_only_precision
argument from trtllm-build
command.attention_qk_half_accumulation
argument from trtllm-build
command.use_context_fmha_for_generation
argument from trtllm-build
command.strongly_typed
argument from trtllm-build
command.max_seq_len
reads from the HuggingFace mode config now.free_gpu_memory_fraction
in ModelRunnerCpp
to kv_cache_free_gpu_memory_fraction
.GptManager
API
maxBeamWidth
into TrtGptModelOptionalParams
.schedulerConfig
into TrtGptModelOptionalParams
.ModelRunnerCpp
, including max_tokens_in_paged_kv_cache
, kv_cache_enable_block_reuse
and enable_chunked_context
.ModelConfig
class, and all the options are moved to LLM
class.LLM
class, please refer to examples/high-level-api/README.md
model
to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.TLLM_HLAPI_BUILD_CACHE=1
or passing enable_build_cache=True
to LLM
class.BuildConfig
, SchedulerConfig
and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.LLM.generate()
and LLM.generate_async()
API.
SamplingConfig
.SamplingParams
with more extensive parameters, see tensorrt_llm/hlapi/utils.py
.
SamplingParams
contains and manages fields from Python bindings of SamplingConfig
, OutputConfig
, and so on.LLM.generate()
output as RequestOutput
, see tensorrt_llm/hlapi/llm.py
.apps
examples, specially by rewriting both chat.py
and fastapi_server.py
using the LLM
APIs, please refer to the examples/apps/README.md
for details.
chat.py
to support multi-turn conversation, allowing users to chat with a model in the terminal.fastapi_server.py
and eliminate the need for mpirun
in multi-GPU scenarios.SpeculativeDecodingMode.h
to choose between different speculative decoding techniques.SpeculativeDecodingModule.h
base class for speculative decoding techniques.decodingMode.h
.gptManagerBenchmark
api
in gptManagerBenchmark
command is executor
by default now.max_batch_size
.max_num_tokens
.bias
argument to the LayerNorm
module, and supports non-bias layer normalization.GptSession
Python bindings.examples/jais/README.md
.examples/dit/README.md
.Video NeVA
section in examples/multimodal/README.md
.examples/grok/README.md
.examples/phi/README.md
.top_k
type in executor.py
, thanks to the contribution from @vonjackustc in #1329.qkv_bias
shape issue for Qwen1.5-32B (#1589), thanks to the contribution from @Tlntin in #1637.fpA_intB
, thanks to the contribution from @JamesTheZ in #1583.examples/qwenvl/requirements.txt
, thanks to the contribution from @ngoanpv in #1248.lora_manager
, thanks to the contribution from @TheCodeWrangler in #1669.convert_hf_mpt_legacy
call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in #1534.use_fp8_context_fmha
broken outputs (#1539).quantize.py
is export data to config.json, thanks to the contribution from @janpetrov: #1676shared_embedding_table
is not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz.ModelRunner
#1815, thanks to the contribution from @Marks101.FAST_BUILD
, thanks to the support from @lkm2835 in #1851.benchmarks/cpp/README.md
for #1562 and #1552.nvcr.io/nvidia/pytorch:24.05-py3
.nvcr.io/nvidia/tritonserver:24.05-py3
.OSError: exception: access violation reading 0x0000000000000000
. This issue is under investigation.Currently, there are two key branches in the project:
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by kaiyux 4 months ago
Hi,
We are very pleased to announce the 0.10.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
executor
API.trtllm-refit
command. For more information, refer to examples/sample_weight_stripping/README.md
.docs/source/advanced/weight-streaming.md
.--multiple_profiles
argument in trtllm-build
command builds more optimization profiles now for better performance.applyBiasRopeUpdateKVCache
kernel by avoiding re-computation.enqueue
calls of TensorRT engines.--visualize_network
and --dry_run
) to the trtllm-build
command to visualize the TensorRT network before engine build.ModelRunnerCpp
so that it runs with the executor
API for IFB-compatible models.AllReduce
by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.gptManagerBenchmark
.Time To the First Token (TTFT)
latency and Inter-Token Latency (ITL)
metrics for gptManagerBenchmark
.--max_attention_window
option to gptManagerBenchmark
.tokens_per_block
argument of the trtllm-build
command to 64 for better performance.GptModelConfig
to ModelConfig
.SchedulerPolicy
with the same name in batch_scheduler
and executor
, and renamed it to CapacitySchedulerPolicy
.SchedulerPolicy
to SchedulerConfig
to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy
.generate()
and generate_async()
APIs. For example, when given a prompt as A B
, the original generation result could be <s>A B C D E
where only C D E
is the actual output, and now the result is C D E
.add_special_token
in the TensorRT-LLM backend to True
.GptSession
and TrtGptModelV1
.gather_all_token_logits
. (#1284)gpt_attention_plugin
for enc-dec models. (#1343)nvcr.io/nvidia/pytorch:24.03-py3
.nvcr.io/nvidia/tritonserver:24.03-py3
.Currently, there are two key branches in the project:
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by kaiyux 6 months ago
Hi,
We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
examples/multimodal
early_stopping=False
in beam search for C++ RuntimeGptSession
without OpenMPI #1220executor
API
examples/bindings
executor
C++ API, see examples/bindings/README.md
executor
API, see docs/source/executor.md
examples/high-level-api/README.md
for guidance)
QuantConfig
used in trtllm-build
tool, support broader quantization featuresLLM()
API to accept engines built by trtllm-build
commandSamplingConfig
used in LLM.generate
or LLM.generate_async
APIs, with the support of beam search, a variety of penalties, and more featuresLLM(streaming_llm=...)
examples/qwen/README.md
for the latest commandsexamples/gpt/README.md
for the latest commandstrtllm-build
command, to generalize the feature better to more modelstrtllm-build --max_prompt_embedding_table_size
instead.trtllm-build --world_size
flag to --auto_parallel
flag, the option is used for auto parallel planner only.AsyncLLMEngine
is removed, tensorrt_llm.GenerationExecutor
class is refactored to work with both explicitly launching with mpirun
in the application level, and accept an MPI communicator created by mpi4py
examples/server
are removed, see examples/app
instead.model
parameter from gptManagerBenchmark
and gptSessionBenchmark
encoder_input_len_range
is not 0, thanks to the contribution from @Eddie-Wang1120 in #992end_id
issue for Qwen #987head_size
when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in #1148SamplingConfig
tensors in ModelRunnerCpp
#1183examples/run.py
only load one line from --input_file
ModelRunnerCpp
does not transfer SamplingConfig
tensor fields correctly #1183gptManagerBenchmark
benchmarks/cpp/README.md
gptManagerBenchmark
gptDecoderBatch
to support batched samplingnvcr.io/nvidia/pytorch:24.02-py3
nvcr.io/nvidia/tritonserver:24.02-py3
Currently, there are two key branches in the project:
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by kaiyux 8 months ago
Hi,
We are very pleased to announce the 0.8.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
temperature
parameter of sampling configuration should be 0repetition_penalty
and presence_penalty
#274frequency_penalty
#275masked_select
and cumsum
function for modelingLayerNorm
and RMSNorm
plugins and removed corresponding build parametersmaxNumSequences
for GPT manager--gather_all_token_logits
is enabled #639gptManagerBenchmark
#649InferenceRequest
#701freeGpuMemoryFraction
parameter from 0.85 to 0.9 for higher throughputenable_trt_overlap
argument for GPT manager by defaultdocs/source/new_workflow.md
documentationCurrently, there are two key branches in the project:
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by kaiyux 10 months ago
Hi,
We are very pleased to announce the 0.7.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
GptManager
ModelRunnerCpp
that wraps C++ gptSession
trtllm-build
command(already applied to blip2 and OPT )StoppingCriteria
and LogitsProcessor
in Python generate API (thanks to the contribution from @zhang-ge-hao)Currently, there are two key branches in the project:
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by kaiyux 11 months ago
Hi,
We are very pleased to announce the 0.6.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
sequence_length
tensor to support proper lengths in beam-search (when beam-width > 1 - see tensorrt_llm/batch_manager/GptManager.h)excludeInputInOutput
in GptManager
)pybind
)GptSession::Config::ctxMicroBatchSize
and GptSession::Config::genMicroBatchSize
in tensorrt_llm/runtime/gptSession.h)mComputeContextLogits
and mComputeGenerationLogits
in tensorrt_llm/runtime/gptModelConfig.h)logProbs
and cumLogProbs
(see "output_log_probs"
and "cum_log_probs"
in GptManager
)host_max_kv_cache_length
) in engine are not the same as expected in the main branch" #369world_size = 2
("array split does not result in an equal division") #374end_id
for various models [C++ and Python]max_batch_size
in the engine builder and max_num_sequences
in TrtGptModelOptionalParams? #65Currently, there are two key branches in the project:
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently. The exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
Published by juney-nvidia 12 months ago