Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions _posts/2025-04-23-pytorch-2-7.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,13 @@ This release is composed of 3262 commits from 457 contributors since PyTorch 2.6
<tr>
<td>
</td>
<td>FlexAttention LLM <span style="text-decoration:underline;">first token processing</span> on X86 CPUs
<td>FlexAttention LLM <span style="text-decoration:underline;">first token processing</span> on x86 CPUs
</td>
</tr>
<tr>
<td>
</td>
<td>FlexAttention LLM <span style="text-decoration:underline;">throughput mode optimization</span> on X86 CPUs
<td>FlexAttention LLM <span style="text-decoration:underline;">throughput mode optimization</span> on x86 CPUs
</td>
</tr>
<tr>
Expand Down Expand Up @@ -135,9 +135,9 @@ For more information regarding Intel GPU support, please refer to [Getting Start
See also the tutorials [here](https://pytorch.org/tutorials/prototype/inductor_windows.html) and [here](https://pytorch.org/tutorials/prototype/pt2e_quant_xpu_inductor.html).


### [Prototype] FlexAttention LLM first token processing on X86 CPUs
### [Prototype] FlexAttention LLM first token processing on x86 CPUs

FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.
FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.


### [Prototype] FlexAttention LLM throughput mode optimization
Expand Down