diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 7941b1f49ee8..7634cc0859ed 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -224,6 +224,6 @@ snakeviz expensive_function.prof Leverage VLLM_GC_DEBUG environment variable to debug GC costs. -- VLLM_GC_DEBUG=1: enable GC debugger with gc.collect elpased times +- VLLM_GC_DEBUG=1: enable GC debugger with gc.collect elapsed times - VLLM_GC_DEBUG='{"top_objects":5}': enable GC debugger to log top 5 collected objects for each gc.collect diff --git a/docs/design/io_processor_plugins.md b/docs/design/io_processor_plugins.md index 2f4b17f191a5..91ab4deae71d 100644 --- a/docs/design/io_processor_plugins.md +++ b/docs/design/io_processor_plugins.md @@ -1,6 +1,6 @@ # IO Processor Plugins -IO Processor plugins are a feature that allows pre and post processing of the model input and output for pooling models. The idea is that users are allowed to pass a custom input to vLLM that is converted into one or more model prompts and fed to the model `encode` method. One potential use-case of such plugins is that of using vLLM for generating multi-modal data. Say users feed an image to vLLM and get an image in output. +IO Processor plugins are a feature that allows pre- and post-processing of the model input and output for pooling models. The idea is that users are allowed to pass a custom input to vLLM that is converted into one or more model prompts and fed to the model `encode` method. One potential use-case of such plugins is that of using vLLM for generating multi-modal data. Say users feed an image to vLLM and get an image in output. When performing an inference with IO Processor plugins, the prompt type is defined by the plugin and the same is valid for the final request output. vLLM does not perform any validation of input/output data, and it is up to the plugin to ensure the correct data is being fed to the model and returned to the user. As of now these plugins support only pooling models and can be triggered via the `encode` method in `LLM` and `AsyncLLM`, or in online serving mode via the `/pooling` endpoint. diff --git a/docs/design/logits_processors.md b/docs/design/logits_processors.md index acf7fc245462..8eadeb386fcf 100644 --- a/docs/design/logits_processors.md +++ b/docs/design/logits_processors.md @@ -411,7 +411,7 @@ Logits processor `update_state()` implementations should assume the following mo * **"Condense" the batch to be contiguous:** starting with the lowest-index empty slot (which was caused by a Remove), apply a Unidirectional Move from the current highest non-empty slot in the batch to fill the empty slot. Proceed with additional Unidirectional Move operations in order of increasing empty slot destination index and decreasing non-empty slot source index until the batch is contiguous - * **Shrink the batch:** a side-effect of condensing the batch is that empty slots resulting from Remove operations are grouped in a contiguous block at the end of the batch array. Thus, after condensing, update `BatchUpdate.batch_size` to reflect the number of non-empty slots + * **Shrink the batch:** a side effect of condensing the batch is that empty slots resulting from Remove operations are grouped in a contiguous block at the end of the batch array. Thus, after condensing, update `BatchUpdate.batch_size` to reflect the number of non-empty slots 5. Reorder the batch for improved efficiency. Depending on the attention backend implementation and the current characteristics of the batch, zero or more Swap Move operations may be applied to reorder the batch @@ -548,7 +548,7 @@ Built-in logits processors are always loaded when the vLLM engine starts. See th Review these logits processor implementations for guidance on writing built-in logits processors. -Additionally, the following logits-processor-like functionalities are hard-coded into the sampler and do not yet utilize the programming model described above. Most of them will be refactored to use the aforemented logits processor programming model. +Additionally, the following logits-processor-like functionalities are hard-coded into the sampler and do not yet utilize the programming model described above. Most of them will be refactored to use the aforementioned logits processor programming model. * Allowed token IDs diff --git a/docs/features/disagg_prefill.md b/docs/features/disagg_prefill.md index 3e8cb87e37d3..fd4f249f2ec6 100644 --- a/docs/features/disagg_prefill.md +++ b/docs/features/disagg_prefill.md @@ -91,6 +91,6 @@ Disaggregated prefilling is highly related to infrastructure, so vLLM relies on We recommend three ways of implementations: -- **Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions. +- **Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc.). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions. - **Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL. - **Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`. diff --git a/docs/features/lora.md b/docs/features/lora.md index 3a85b52d89b6..d42a3cef76bd 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -4,7 +4,7 @@ This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09 LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA]. -Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save +Adapters can be efficiently served on a per-request basis with minimal overhead. First we download the adapter(s) and save them locally with ```python diff --git a/vllm/lora/ops/triton_ops/fused_moe_lora_op.py b/vllm/lora/ops/triton_ops/fused_moe_lora_op.py index 893972144e99..e2dd47dbb4e6 100644 --- a/vllm/lora/ops/triton_ops/fused_moe_lora_op.py +++ b/vllm/lora/ops/triton_ops/fused_moe_lora_op.py @@ -154,7 +154,7 @@ def _fused_moe_lora_kernel( k_remaining = K - k * (BLOCK_SIZE_K * SPLIT_K) # pre-fetch lora weight b = tl.load(b_ptrs, mask=offs_k[:, None] < k_remaining, other=0.0) - # GDC wait waits for ALL programs in the the prior kernel to complete + # GDC wait waits for ALL programs in the prior kernel to complete # before continuing. if USE_GDC and not IS_PRIMARY: tl.extra.cuda.gdc_wait()