TensorRT-LLM v0.11 Update #1969
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TensorRT-LLM Release 0.11.0
Key Features and Enhancements
examples/llama/README.md).examples/qwen/README.md.examples/phi/README.md.examples/gpt/README.md.distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.numQueuedRequeststo the iteration stats log of the executor API.iterLatencyMilliSecto the iteration stats log of the executor API.API Changes
trtllm-buildcommandtrtllm-buildcommand), see documents: examples/whisper/README.md.max_batch_sizeintrtllm-buildcommand is switched to 256 by default.max_num_tokensintrtllm-buildcommand is switched to 8192 by default.max_output_lenand addedmax_seq_len.--weight_only_precisionargument fromtrtllm-buildcommand.attention_qk_half_accumulationargument fromtrtllm-buildcommand.use_context_fmha_for_generationargument fromtrtllm-buildcommand.strongly_typedargument fromtrtllm-buildcommand.max_seq_lenreads from the HuggingFace mode config now.free_gpu_memory_fractioninModelRunnerCpptokv_cache_free_gpu_memory_fraction.GptManagerAPImaxBeamWidthintoTrtGptModelOptionalParams.schedulerConfigintoTrtGptModelOptionalParams.ModelRunnerCpp, includingmax_tokens_in_paged_kv_cache,kv_cache_enable_block_reuseandenable_chunked_context.ModelConfigclass, and all the options are moved toLLMclass.LLMclass, please refer toexamples/high-level-api/README.mdmodelto accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.TLLM_HLAPI_BUILD_CACHE=1or passingenable_build_cache=TruetoLLMclass.BuildConfig,SchedulerConfigand so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.LLM.generate()andLLM.generate_async()API.SamplingConfig.SamplingParamswith more extensive parameters, seetensorrt_llm/hlapi/utils.py.SamplingParamscontains and manages fields from Python bindings ofSamplingConfig,OutputConfig, and so on.LLM.generate()output asRequestOutput, seetensorrt_llm/hlapi/llm.py.appsexamples, specially by rewriting bothchat.pyandfastapi_server.pyusing theLLMAPIs, please refer to theexamples/apps/README.mdfor details.chat.pyto support multi-turn conversation, allowing users to chat with a model in the terminal.fastapi_server.pyand eliminate the need formpirunin multi-GPU scenarios.SpeculativeDecodingMode.hto choose between different speculative decoding techniques.SpeculativeDecodingModule.hbase class for speculative decoding techniques.decodingMode.h.gptManagerBenchmarkapiingptManagerBenchmarkcommand isexecutorby default now.max_batch_size.max_num_tokens.biasargument to theLayerNormmodule, and supports non-bias layer normalization.GptSessionPython bindings.Model Updates
examples/jais/README.md.examples/dit/README.md.Video NeVAsection inexamples/multimodal/README.md.examples/grok/README.md.examples/phi/README.md.Fixed Issues
top_ktype inexecutor.py, thanks to the contribution from @vonjackustc in Fix top_k type (float => int32) executor.py #1329.qkv_biasshape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候,qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.fpA_intB, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.lora_manager, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.convert_hf_mpt_legacycall failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.use_fp8_context_fmhabroken outputs (use_fp8_context_fmha broken outputs #1539).quantize.pyis export data to config.json, thanks to the contribution from @janpetrov: quantize.py fails to export important data to config.json (eg rotary scaling) #1676shared_embedding_tableis not being set when loading Gemma [GEMMA]from_hugging_facenot settingshare_embedding_tableto True leading to incapacity to load Gemma #1799, thanks to the contribution from @mfuntowicz.ModelRunner[ModelRunner] Fix stop and bad words list contiguous for offsets #1815, thanks to the contribution from @Marks101.FAST_BUILD, thanks to the support from @lkm2835 in Add FAST_BUILD comment at #endif #1851.benchmarks/cpp/README.mdfor gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.05-py3.nvcr.io/nvidia/tritonserver:24.05-py3.Known Issues
OSError: exception: access violation reading 0x0000000000000000. This issue is under investigation.