Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 28, 2025

📄 61% (0.61x) speedup for get_empty_batch_elements_indices in inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py

⏱️ Runtime : 393 microseconds 245 microseconds (best of 337 runs)

📝 Explanation and details

The optimized code replaces recursive function calls with an iterative approach using a stack, delivering a 60% speedup. Here's why this optimization is so effective:

Key Optimization: Recursive to Iterative Conversion

  • Original: Made recursive calls for every dict value, list element, and nested Batch, creating function call overhead and multiple intermediate result sets
  • Optimized: Uses a single stack to traverse the entire data structure iteratively, eliminating all recursive function calls

Performance Impact Analysis:

  1. Eliminates expensive set unions: The original code performed result.union(value_result) operations (5-6.2% of total time), creating new set objects repeatedly. The optimized version directly adds indices to a single result set.

  2. Reduces function call overhead: The line profiler shows the original made 2,251 recursive calls (lines with 1023+1228 hits), while the optimized version uses simple stack operations with no function call overhead.

  3. Better memory efficiency: Instead of creating intermediate result sets that get merged, the optimized version maintains one result set and one stack.

Test Case Performance Patterns:

  • Small/simple cases (basic batches): 30-40% slower due to stack overhead vs direct processing
  • Medium complexity (nested lists/dicts): 2-20% faster as stack efficiency overcomes recursive overhead
  • Large-scale cases: 60-85% faster - the optimization shines with complex nested structures where recursive overhead dominates

The optimization is most beneficial for workloads with deeply nested or large collections of batches, where the original recursive approach created significant call stack and memory allocation overhead.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 24 Passed
🌀 Generated Regression Tests 50 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 88.2%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
workflows/unit_tests/execution_engine/executor/execution_data_manager/test_step_input_assembler.py::test_get_empty_batch_elements_indices_from_dict_of_batches 4.25μs 3.74μs 13.7%✅
workflows/unit_tests/execution_engine/executor/execution_data_manager/test_step_input_assembler.py::test_get_empty_batch_elements_indices_from_list_of_batches 4.67μs 4.15μs 12.5%✅
workflows/unit_tests/execution_engine/executor/execution_data_manager/test_step_input_assembler.py::test_get_empty_batch_elements_indices_from_non_batch_elements 5.26μs 5.71μs -7.83%⚠️
workflows/unit_tests/execution_engine/executor/execution_data_manager/test_step_input_assembler.py::test_get_empty_batch_elements_indices_from_single_batch 1.96μs 2.15μs -8.61%⚠️
🌀 Generated Regression Tests and Runtime
from typing import Any, Set

# imports
import pytest  # used for our unit tests
from inference.core.workflows.execution_engine.v1.executor.execution_data_manager.step_input_assembler import \
    get_empty_batch_elements_indices

# --- Function to test ---
# Simulate the relevant classes and types for the tests

class DynamicBatchIndex(int):
    """A simple subclass of int to simulate DynamicBatchIndex."""
    pass

class Batch:
    """
    Simulates a Batch object that can contain elements, possibly other Batches.
    Provides iter_with_indices() to iterate over (index, value) pairs.
    """
    def __init__(self, elements):
        self.elements = elements

    def iter_with_indices(self):
        # Return (DynamicBatchIndex(index), value) for each element
        for i, v in enumerate(self.elements):
            yield DynamicBatchIndex(i), v

# --- Unit tests ---

# ----------- BASIC TEST CASES -----------

def test_batch_with_no_none_elements():
    """Batch with all non-None elements should return empty set."""
    batch = Batch([1, 2, 3])
    codeflash_output = get_empty_batch_elements_indices(batch) # 506ns -> 809ns (37.5% slower)

def test_batch_with_some_none_elements():
    """Batch with some None elements should return their indices."""
    batch = Batch([None, 2, None])
    codeflash_output = get_empty_batch_elements_indices(batch) # 519ns -> 767ns (32.3% slower)

def test_batch_with_all_none_elements():
    """Batch with all elements None should return all indices."""
    batch = Batch([None, None, None])
    codeflash_output = get_empty_batch_elements_indices(batch) # 513ns -> 761ns (32.6% slower)

def test_empty_batch():
    """Empty Batch should return empty set."""
    batch = Batch([])
    codeflash_output = get_empty_batch_elements_indices(batch) # 493ns -> 763ns (35.4% slower)

def test_list_of_batches():
    """List containing multiple batches, some with None elements."""
    batch1 = Batch([None, 1])
    batch2 = Batch([2, None])
    codeflash_output = get_empty_batch_elements_indices([batch1, batch2]); result = codeflash_output # 1.65μs -> 1.46μs (13.7% faster)

def test_dict_of_batches():
    """Dict containing batches as values."""
    batch1 = Batch([None, 1])
    batch2 = Batch([2, None])
    codeflash_output = get_empty_batch_elements_indices({'a': batch1, 'b': batch2}); result = codeflash_output # 1.78μs -> 1.72μs (3.49% faster)

def test_nested_batch():
    """Batch containing another Batch as element."""
    inner = Batch([None, 1])
    outer = Batch([inner, 2, None])
    # Should find None at inner[0] and outer[2]
    codeflash_output = get_empty_batch_elements_indices(outer); result = codeflash_output # 496ns -> 722ns (31.3% slower)

# ----------- EDGE TEST CASES -----------

def test_batch_with_mixed_types():
    """Batch with None, int, str, and Batch as elements."""
    inner = Batch([None, "a"])
    batch = Batch([None, 1, "x", inner])
    # Should find None at batch[0], inner[0]
    codeflash_output = get_empty_batch_elements_indices(batch); result = codeflash_output # 508ns -> 801ns (36.6% slower)

def test_empty_list():
    """Empty list should return empty set."""
    codeflash_output = get_empty_batch_elements_indices([]) # 633ns -> 851ns (25.6% slower)

def test_empty_dict():
    """Empty dict should return empty set."""
    codeflash_output = get_empty_batch_elements_indices({}) # 854ns -> 1.13μs (24.6% slower)

def test_non_batch_non_iterable_value():
    """Non-Batch, non-iterable value should return empty set."""
    codeflash_output = get_empty_batch_elements_indices(123) # 618ns -> 856ns (27.8% slower)
    codeflash_output = get_empty_batch_elements_indices("abc") # 370ns -> 447ns (17.2% slower)
    codeflash_output = get_empty_batch_elements_indices(None) # 211ns -> 271ns (22.1% slower)

def test_list_with_none_and_batch():
    """List with None and a Batch containing None."""
    batch = Batch([None, 2])
    value = [None, batch]
    # Should find None in batch[0], but not top-level None (since only Batch elements are counted)
    codeflash_output = get_empty_batch_elements_indices(value); result = codeflash_output # 1.63μs -> 1.42μs (14.3% faster)

def test_dict_with_nested_list_and_batch():
    """Dict with nested list containing batches."""
    batch1 = Batch([None])
    batch2 = Batch([None, None])
    value = {'x': [batch1, batch2]}
    codeflash_output = get_empty_batch_elements_indices(value); result = codeflash_output # 2.24μs -> 1.92μs (16.5% faster)

def test_batch_with_nested_batch_all_none():
    """Batch with a nested Batch where all elements are None."""
    inner = Batch([None, None])
    outer = Batch([inner])
    codeflash_output = get_empty_batch_elements_indices(outer); result = codeflash_output # 497ns -> 750ns (33.7% slower)

def test_batch_with_deeply_nested_batches():
    """Batch with multiple levels of nested batches containing None."""
    deep = Batch([None])
    mid = Batch([deep, None])
    top = Batch([mid, 3])
    # Should find None at mid[1] and deep[0]
    codeflash_output = get_empty_batch_elements_indices(top); result = codeflash_output # 497ns -> 742ns (33.0% slower)

def test_batch_with_duplicate_none_indices():
    """Batch with nested batches that have overlapping None indices."""
    inner1 = Batch([None])
    inner2 = Batch([None])
    outer = Batch([inner1, inner2])
    codeflash_output = get_empty_batch_elements_indices(outer); result = codeflash_output # 488ns -> 751ns (35.0% slower)

# ----------- LARGE SCALE TEST CASES -----------

def test_large_batch_no_none():
    """Large batch with no None elements."""
    batch = Batch([i for i in range(1000)])
    codeflash_output = get_empty_batch_elements_indices(batch) # 517ns -> 739ns (30.0% slower)

def test_large_batch_some_none():
    """Large batch with some None elements at regular intervals."""
    elements = [None if i % 100 == 0 else i for i in range(1000)]
    batch = Batch(elements)
    expected = {DynamicBatchIndex(i) for i in range(0, 1000, 100)}
    codeflash_output = get_empty_batch_elements_indices(batch) # 518ns -> 757ns (31.6% slower)

def test_large_list_of_batches():
    """Large list of batches, each with one None at a specific index."""
    batches = [Batch([i, None]) for i in range(1000)]
    # Should find None at index 1 for each batch, but only indices are returned (not batch index)
    codeflash_output = get_empty_batch_elements_indices(batches); result = codeflash_output # 150μs -> 80.4μs (86.9% faster)

def test_large_dict_of_batches():
    """Large dict of batches, each with None at index 0."""
    batches = {str(i): Batch([None, i]) for i in range(1000)}
    codeflash_output = get_empty_batch_elements_indices(batches); result = codeflash_output # 152μs -> 82.2μs (85.0% faster)

def test_large_nested_batches():
    """Batch containing 1000 batches, each with None at index 0."""
    inner_batches = [Batch([None, i]) for i in range(1000)]
    outer_batch = Batch(inner_batches)
    codeflash_output = get_empty_batch_elements_indices(outer_batch); result = codeflash_output # 563ns -> 853ns (34.0% slower)
    # Should find None at each inner batch's index in outer_batch
    expected = {DynamicBatchIndex(i) for i in range(1000)}

def test_large_batch_all_none():
    """Large batch with all elements None."""
    batch = Batch([None] * 1000)
    expected = {DynamicBatchIndex(i) for i in range(1000)}
    codeflash_output = get_empty_batch_elements_indices(batch) # 571ns -> 810ns (29.5% slower)

# ----------- ADDITIONAL EDGE CASES -----------

def test_batch_with_non_batch_iterable():
    """Batch containing a list (not a Batch) with None."""
    batch = Batch([[None, 1], None])
    # Only None at index 1 should be counted, since [None, 1] is not a Batch
    codeflash_output = get_empty_batch_elements_indices(batch) # 511ns -> 762ns (32.9% slower)

def test_batch_with_dict_element():
    """Batch containing a dict (not a Batch) with None."""
    batch = Batch([{'a': None}, None])
    # Only None at index 1 should be counted, since {'a': None} is not a Batch
    codeflash_output = get_empty_batch_elements_indices(batch) # 515ns -> 760ns (32.2% slower)

def test_batch_with_empty_batch_element():
    """Batch containing an empty Batch."""
    empty = Batch([])
    batch = Batch([empty, None])
    # None at index 1, empty batch at index 0 has no None elements
    codeflash_output = get_empty_batch_elements_indices(batch) # 511ns -> 749ns (31.8% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Any, Set

# imports
import pytest
from inference.core.workflows.execution_engine.v1.executor.execution_data_manager.step_input_assembler import \
    get_empty_batch_elements_indices

# --- Minimal stubs for Batch and DynamicBatchIndex for testing purposes ---

class DynamicBatchIndex(int):
    """A simple subclass of int for identity."""
    pass

class Batch:
    """
    A minimal Batch class for testing.
    Holds a list of elements, exposes iter_with_indices().
    """
    def __init__(self, elements):
        self.elements = elements

    def iter_with_indices(self):
        """Yields (DynamicBatchIndex, element) for each element."""
        for idx, el in enumerate(self.elements):
            yield DynamicBatchIndex(idx), el

# unit tests

# ----------------- BASIC TEST CASES -----------------

def test_empty_batch_returns_all_indices():
    # All elements are None, should return all indices
    b = Batch([None, None, None])
    codeflash_output = get_empty_batch_elements_indices(b) # 552ns -> 932ns (40.8% slower)

def test_batch_no_empty_elements():
    # No elements are None, should return empty set
    b = Batch([1, 2, 3])
    codeflash_output = get_empty_batch_elements_indices(b) # 528ns -> 784ns (32.7% slower)

def test_batch_some_empty_elements():
    # Some elements are None, should return their indices
    b = Batch([None, 2, None, 4])
    codeflash_output = get_empty_batch_elements_indices(b) # 517ns -> 762ns (32.2% slower)

def test_list_of_batches():
    # List of batches, some have None elements
    b1 = Batch([None, 1])
    b2 = Batch([2, None])
    codeflash_output = get_empty_batch_elements_indices([b1, b2]); result = codeflash_output # 1.64μs -> 1.60μs (2.50% faster)

def test_dict_of_batches():
    # Dict of batches, some have None elements
    b1 = Batch([None, 1])
    b2 = Batch([2, None])
    d = {'a': b1, 'b': b2}
    codeflash_output = get_empty_batch_elements_indices(d); result = codeflash_output # 1.88μs -> 1.78μs (5.33% faster)

def test_nested_list_and_dict():
    # Nested list/dict structures containing batches
    b1 = Batch([None, 1])
    b2 = Batch([2, None])
    structure = {'first': [b1], 'second': {'deep': b2}}
    codeflash_output = get_empty_batch_elements_indices(structure); result = codeflash_output # 2.47μs -> 2.28μs (8.52% faster)

def test_batch_with_nested_batch():
    # Batch contains another batch as element
    inner = Batch([None, 2])
    outer = Batch([inner, 3, None])
    # Should find None in inner (index 0) and None in outer (index 2)
    codeflash_output = get_empty_batch_elements_indices(outer); result = codeflash_output # 429ns -> 749ns (42.7% slower)

# ----------------- EDGE TEST CASES -----------------

def test_empty_batch():
    # Batch with no elements
    b = Batch([])
    codeflash_output = get_empty_batch_elements_indices(b) # 492ns -> 732ns (32.8% slower)

def test_empty_list():
    # Empty list
    codeflash_output = get_empty_batch_elements_indices([]) # 592ns -> 920ns (35.7% slower)

def test_empty_dict():
    # Empty dict
    codeflash_output = get_empty_batch_elements_indices({}) # 826ns -> 1.09μs (24.6% slower)

def test_none_input():
    # Input is None (not a Batch), should return empty set
    codeflash_output = get_empty_batch_elements_indices(None) # 487ns -> 774ns (37.1% slower)

def test_non_batch_non_collection_input():
    # Input is a scalar, not a batch or collection
    codeflash_output = get_empty_batch_elements_indices(42) # 625ns -> 854ns (26.8% slower)
    codeflash_output = get_empty_batch_elements_indices("string") # 341ns -> 445ns (23.4% slower)
    codeflash_output = get_empty_batch_elements_indices(3.14) # 214ns -> 282ns (24.1% slower)

def test_batch_with_all_nested_batches_empty():
    # Batch contains only batches, all of which are empty
    b1 = Batch([])
    b2 = Batch([])
    outer = Batch([b1, b2])
    codeflash_output = get_empty_batch_elements_indices(outer) # 469ns -> 718ns (34.7% slower)

def test_batch_with_mixed_types():
    # Batch contains None, batch, and scalar
    inner = Batch([None])
    outer = Batch([None, inner, 5])
    # Should find None at outer[0] and inner[0]
    codeflash_output = get_empty_batch_elements_indices(outer) # 484ns -> 735ns (34.1% slower)

def test_deeply_nested_structures():
    # Deeply nested mix of dicts, lists, and batches
    b = Batch([None, 2, None])
    structure = {'a': [Batch([None]), {'b': b}]}
    codeflash_output = get_empty_batch_elements_indices(structure); result = codeflash_output # 2.71μs -> 2.25μs (20.4% faster)

def test_batch_with_duplicate_none_indices():
    # Different batches at different nesting with same index None
    b1 = Batch([None, 1])
    b2 = Batch([None, 2])
    structure = [b1, b2]
    # Both have None at index 0, but indices are not distinguished by batch
    codeflash_output = get_empty_batch_elements_indices(structure) # 1.48μs -> 1.51μs (1.79% slower)

def test_batch_with_non_iterable_elements():
    # Batch contains elements that are not iterable, including None
    b = Batch([None, 1, "foo", 3.14])
    codeflash_output = get_empty_batch_elements_indices(b) # 488ns -> 749ns (34.8% slower)

# ----------------- LARGE SCALE TEST CASES -----------------

def test_large_batch():
    # Large batch with 1000 elements, every 10th is None
    elements = [None if i % 10 == 0 else i for i in range(1000)]
    b = Batch(elements)
    expected = {DynamicBatchIndex(i) for i in range(0, 1000, 10)}
    codeflash_output = get_empty_batch_elements_indices(b) # 513ns -> 779ns (34.1% slower)

def test_large_nested_structure():
    # Dict of 10 lists, each with 10 batches of 10 elements, every 7th element is None
    structure = {}
    for d in range(10):
        structure[d] = []
        for l in range(10):
            elements = [None if i % 7 == 0 else i for i in range(10)]
            structure[d].append(Batch(elements))
    # All None indices are multiples of 7 in 0..9, i.e., 0 and 7
    expected = {DynamicBatchIndex(0), DynamicBatchIndex(7)}
    codeflash_output = get_empty_batch_elements_indices(structure); result = codeflash_output # 19.2μs -> 11.8μs (63.2% faster)

def test_large_mixed_nesting():
    # List of 100 batches, each with 10 elements, every 3rd is None
    batches = [Batch([None if i % 3 == 0 else i for i in range(10)]) for _ in range(100)]
    expected = {DynamicBatchIndex(i) for i in range(0, 10, 3)}
    codeflash_output = get_empty_batch_elements_indices(batches); result = codeflash_output # 16.3μs -> 9.66μs (68.9% faster)

def test_performance_large_deep_nesting():
    # 10-level nested lists, each containing a batch of 10 elements, first element None
    nested = Batch([None] + [1]*9)
    for _ in range(10):
        nested = [nested]
    # Should still find index 0 as None
    codeflash_output = get_empty_batch_elements_indices(nested) # 3.19μs -> 1.72μs (86.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_empty_batch_elements_indices-mh9v16vh and push.

Codeflash

The optimized code replaces recursive function calls with an iterative approach using a stack, delivering a **60% speedup**. Here's why this optimization is so effective:

**Key Optimization: Recursive to Iterative Conversion**
- **Original**: Made recursive calls for every dict value, list element, and nested Batch, creating function call overhead and multiple intermediate result sets
- **Optimized**: Uses a single stack to traverse the entire data structure iteratively, eliminating all recursive function calls

**Performance Impact Analysis:**
1. **Eliminates expensive set unions**: The original code performed `result.union(value_result)` operations (5-6.2% of total time), creating new set objects repeatedly. The optimized version directly adds indices to a single result set.

2. **Reduces function call overhead**: The line profiler shows the original made 2,251 recursive calls (lines with 1023+1228 hits), while the optimized version uses simple stack operations with no function call overhead.

3. **Better memory efficiency**: Instead of creating intermediate result sets that get merged, the optimized version maintains one result set and one stack.

**Test Case Performance Patterns:**
- **Small/simple cases (basic batches)**: 30-40% slower due to stack overhead vs direct processing
- **Medium complexity (nested lists/dicts)**: 2-20% faster as stack efficiency overcomes recursive overhead  
- **Large-scale cases**: 60-85% faster - the optimization shines with complex nested structures where recursive overhead dominates

The optimization is most beneficial for workloads with deeply nested or large collections of batches, where the original recursive approach created significant call stack and memory allocation overhead.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 28, 2025 01:00
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant