Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 28, 2025

📄 18% (0.18x) speedup for stack_conds in modules/prompt_parser.py

⏱️ Runtime : 5.74 milliseconds 4.86 milliseconds (best of 152 runs)

📝 Explanation and details

The optimization achieves an 18% speedup through two key changes:

1. Generator expression for max calculation:
Changed max([x.shape[0] for x in tensors]) to max(x.shape[0] for x in tensors) to eliminate the intermediate list allocation, providing a small memory efficiency gain.

2. More efficient tensor padding:
Replaced the two-step repeat + vstack approach with a single torch.cat + expand operation:

  • Original: last_vector.repeat([pad_size, 1]) creates a new tensor copy, then torch.vstack concatenates
  • Optimized: last_vector.expand(pad_size, -1) creates a memory-efficient view (no data copy), then torch.cat concatenates directly

The expand operation is significantly faster than repeat because it creates a view that shares memory rather than copying data. This is especially effective when padding tensors with large differences in length - test cases show 20-42% speedups for scenarios requiring substantial padding (like test_stack_conds_large_scale_varied_lengths with 21.6% improvement).

The optimization maintains identical functionality while reducing both memory allocations and tensor operations, making it particularly effective for workloads with many tensors requiring padding to a common length.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 27 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

# imports
import pytest  # used for our unit tests
import torch  # needed for tensor creation and manipulation
from modules.prompt_parser import stack_conds

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_stack_conds_basic_equal_shapes():
    # All tensors have the same shape, so stacking should be straightforward
    tensors = [torch.ones(3, 4), torch.zeros(3, 4), torch.full((3, 4), 2)]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 30.9μs -> 30.2μs (2.29% faster)

def test_stack_conds_basic_different_lengths():
    # Tensors have different first dimensions, should pad to longest
    tensors = [
        torch.ones(2, 4),           # shape (2,4)
        torch.zeros(3, 4),          # shape (3,4)
        torch.full((1, 4), 7)       # shape (1,4)
    ]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 60.1μs -> 50.0μs (20.2% faster)

def test_stack_conds_basic_single_tensor():
    # Only one tensor, should return a stack with one element
    tensor = torch.arange(5).unsqueeze(1)  # shape (5,1)
    codeflash_output = stack_conds([tensor]); result = codeflash_output # 10.9μs -> 10.7μs (2.15% faster)

def test_stack_conds_basic_last_vector_repeat():
    # The last vector should be repeated when padding
    t1 = torch.tensor([[1,2],[3,4]])     # shape (2,2)
    t2 = torch.tensor([[5,6]])           # shape (1,2)
    codeflash_output = stack_conds([t1, t2]); result = codeflash_output # 38.8μs -> 27.9μs (39.2% faster)

# -------------------- EDGE TEST CASES --------------------

def test_stack_conds_empty_list():
    # No tensors provided, should raise an error (max() of empty sequence)
    with pytest.raises(ValueError):
        stack_conds([]) # 1.76μs -> 1.64μs (7.38% faster)


def test_stack_conds_different_column_sizes():
    # Tensors with different column sizes, stacking should fail
    t1 = torch.ones(3, 4)
    t2 = torch.ones(3, 5)
    with pytest.raises(RuntimeError):
        stack_conds([t1, t2]) # 55.3μs -> 56.1μs (1.48% slower)

def test_stack_conds_non_tensor_input():
    # Non-tensor input should fail
    t1 = torch.ones(3, 4)
    t2 = [[1,2,3,4],[5,6,7,8],[9,10,11,12]]  # not a tensor
    with pytest.raises(AttributeError):
        stack_conds([t1, t2]) # 2.58μs -> 3.15μs (18.0% slower)

def test_stack_conds_tensor_with_negative_shape():
    # Torch does not allow negative shapes, but let's check for robustness
    # Should raise error on creation, not in function
    with pytest.raises(RuntimeError):
        torch.empty(-1, 3)

def test_stack_conds_tensor_with_one_row():
    # Tensor with one row, should pad correctly
    t1 = torch.tensor([[1,2,3]])
    t2 = torch.tensor([[4,5,6],[7,8,9]])
    codeflash_output = stack_conds([t1, t2]); result = codeflash_output # 50.9μs -> 35.8μs (42.3% faster)

def test_stack_conds_tensor_with_zero_columns():
    # Tensors with zero columns, should stack as empty
    t1 = torch.empty(2, 0)
    t2 = torch.empty(3, 0)
    codeflash_output = stack_conds([t1, t2]); result = codeflash_output # 35.1μs -> 24.8μs (41.5% faster)

# -------------------- LARGE SCALE TEST CASES --------------------

def test_stack_conds_large_number_of_tensors():
    # Test with 100 tensors of shape (10, 10)
    tensors = [torch.ones(10, 10)*i for i in range(100)]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 41.7μs -> 41.5μs (0.407% faster)
    # Check a few values
    for idx in [0, 50, 99]:
        pass

def test_stack_conds_large_length_padding():
    # Tensors with varying lengths, up to 1000 rows
    tensors = [
        torch.ones(1000, 5),
        torch.zeros(999, 5),
        torch.full((998, 5), 2)
    ]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 71.2μs -> 62.4μs (14.1% faster)

def test_stack_conds_large_wide_tensor():
    # Tensors with large column count, but reasonable row count
    tensors = [
        torch.ones(5, 512),
        torch.zeros(3, 512),
        torch.full((4, 512), 7)
    ]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 58.0μs -> 49.2μs (17.7% faster)

def test_stack_conds_large_tensor_memory_limit():
    # Ensure we do not exceed 100MB: float32, 1000x100x3 = ~1.2MB per tensor, 100 tensors = ~120MB
    # So we keep it under 100 tensors
    tensors = [torch.ones(1000, 30) for _ in range(30)]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 238μs -> 238μs (0.006% faster)
    # Check values
    for i in range(30):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

# imports
import pytest  # used for our unit tests
import torch
from modules.prompt_parser import stack_conds

# unit tests

# Basic Test Cases

def test_stack_conds_basic_equal_shapes():
    # All tensors have the same shape, so stack should be direct
    t1 = torch.ones((4, 3))
    t2 = torch.zeros((4, 3))
    t3 = torch.full((4, 3), 2)
    codeflash_output = stack_conds([t1, t2, t3]); result = codeflash_output # 28.9μs -> 28.8μs (0.226% faster)

def test_stack_conds_basic_different_shapes():
    # Tensors have different first dimensions
    t1 = torch.ones((2, 3))
    t2 = torch.zeros((4, 3))
    t3 = torch.full((3, 3), 2)
    codeflash_output = stack_conds([t1, t2, t3]); result = codeflash_output # 55.7μs -> 47.0μs (18.5% faster)

def test_stack_conds_basic_single_tensor():
    # Only one tensor, should return a stack of shape (1, ...)
    t1 = torch.arange(6).reshape(2, 3)
    codeflash_output = stack_conds([t1]); result = codeflash_output # 10.5μs -> 9.78μs (7.11% faster)

def test_stack_conds_basic_empty_list():
    # Should raise an error if input is empty
    with pytest.raises(ValueError):
        stack_conds([]) # 1.79μs -> 1.76μs (1.88% faster)

# Edge Test Cases

def test_stack_conds_edge_minimal_tensor():
    # Tensors with shape (1, 1)
    t1 = torch.tensor([[1]])
    t2 = torch.tensor([[2]])
    codeflash_output = stack_conds([t1, t2]); result = codeflash_output # 15.2μs -> 15.4μs (1.19% slower)

def test_stack_conds_edge_single_row_padding():
    # One tensor is (1, N), others larger
    t1 = torch.ones((1, 5))
    t2 = torch.zeros((3, 5))
    codeflash_output = stack_conds([t1, t2]); result = codeflash_output # 43.2μs -> 37.5μs (15.1% faster)

def test_stack_conds_edge_different_widths():
    # Should raise error if tensors have different widths
    t1 = torch.ones((2, 3))
    t2 = torch.zeros((2, 4))
    with pytest.raises(RuntimeError):
        stack_conds([t1, t2]) # 52.9μs -> 54.0μs (1.99% slower)


def test_stack_conds_edge_non_tensor_input():
    # Input contains non-tensor
    t1 = torch.ones((2, 3))
    t2 = [[1, 2, 3], [4, 5, 6]]
    with pytest.raises(AttributeError):
        stack_conds([t1, t2]) # 2.92μs -> 3.88μs (24.7% slower)



def test_stack_conds_large_scale_many_tensors():
    # 500 tensors of shape (10, 10)
    tensors = [torch.full((10, 10), i) for i in range(500)]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 161μs -> 163μs (1.05% slower)

def test_stack_conds_large_scale_varied_lengths():
    # 100 tensors with lengths ranging from 1 to 100, width 8
    tensors = [torch.full((i, 8), i) for i in range(1, 101)]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 769μs -> 632μs (21.6% faster)
    # Check that last row of each tensor is repeated correctly
    for idx, t in enumerate(tensors):
        pass

def test_stack_conds_large_scale_max_size():
    # Test near the 100MB limit: 100 tensors of (100, 10)
    # Each float32 tensor: 100*10*4 bytes = 4KB, 100 tensors = 400KB
    tensors = [torch.full((100, 10), i, dtype=torch.float32) for i in range(100)]
    codeflash_output = stack_conds(tensors); result = codeflash_output # 57.5μs -> 57.1μs (0.764% faster)

def test_stack_conds_large_scale_all_need_padding():
    # All tensors have length < max, all need padding
    tensors = [torch.full((50, 5), i) for i in range(10)]
    # Add one tensor with length 100
    tensors.append(torch.full((100, 5), 999))
    codeflash_output = stack_conds(tensors); result = codeflash_output # 116μs -> 97.6μs (19.2% faster)
    # Check that tensors[0..9] have their last row repeated from 50 to 100
    for i in range(10):
        pass

def test_stack_conds_large_scale_randomized():
    # Random lengths and values, up to 500 tensors, max length 100
    import random
    random.seed(42)
    tensors = []
    for i in range(500):
        length = random.randint(1, 100)
        value = random.randint(-100, 100)
        tensors.append(torch.full((length, 7), value))
    codeflash_output = stack_conds(tensors); result = codeflash_output # 3.73ms -> 3.08ms (21.0% faster)
    # Check that each tensor's padding matches its last row
    for idx, t in enumerate(tensors):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-stack_conds-mh9z4ejy and push.

Codeflash

The optimization achieves an 18% speedup through two key changes:

**1. Generator expression for max calculation:**
Changed `max([x.shape[0] for x in tensors])` to `max(x.shape[0] for x in tensors)` to eliminate the intermediate list allocation, providing a small memory efficiency gain.

**2. More efficient tensor padding:**
Replaced the two-step `repeat` + `vstack` approach with a single `torch.cat` + `expand` operation:
- **Original:** `last_vector.repeat([pad_size, 1])` creates a new tensor copy, then `torch.vstack` concatenates
- **Optimized:** `last_vector.expand(pad_size, -1)` creates a memory-efficient view (no data copy), then `torch.cat` concatenates directly

The `expand` operation is significantly faster than `repeat` because it creates a view that shares memory rather than copying data. This is especially effective when padding tensors with large differences in length - test cases show 20-42% speedups for scenarios requiring substantial padding (like `test_stack_conds_large_scale_varied_lengths` with 21.6% improvement).

The optimization maintains identical functionality while reducing both memory allocations and tensor operations, making it particularly effective for workloads with many tensors requiring padding to a common length.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 28, 2025 02:55
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant