⚡️ Speed up function stack_conds by 18%
#7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 18% (0.18x) speedup for
stack_condsinmodules/prompt_parser.py⏱️ Runtime :
5.74 milliseconds→4.86 milliseconds(best of152runs)📝 Explanation and details
The optimization achieves an 18% speedup through two key changes:
1. Generator expression for max calculation:
Changed
max([x.shape[0] for x in tensors])tomax(x.shape[0] for x in tensors)to eliminate the intermediate list allocation, providing a small memory efficiency gain.2. More efficient tensor padding:
Replaced the two-step
repeat+vstackapproach with a singletorch.cat+expandoperation:last_vector.repeat([pad_size, 1])creates a new tensor copy, thentorch.vstackconcatenateslast_vector.expand(pad_size, -1)creates a memory-efficient view (no data copy), thentorch.catconcatenates directlyThe
expandoperation is significantly faster thanrepeatbecause it creates a view that shares memory rather than copying data. This is especially effective when padding tensors with large differences in length - test cases show 20-42% speedups for scenarios requiring substantial padding (liketest_stack_conds_large_scale_varied_lengthswith 21.6% improvement).The optimization maintains identical functionality while reducing both memory allocations and tensor operations, making it particularly effective for workloads with many tensors requiring padding to a common length.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-stack_conds-mh9z4ejyand push.