Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 27, 2025

📄 27% (0.27x) speedup for MicrosoftSQLServerSinkBlockV1._validate_data in inference/enterprise/workflows/enterprise_blocks/sinks/microsoft_sql_server/v1.py

⏱️ Runtime : 435 microseconds 343 microseconds (best of 121 runs)

📝 Explanation and details

The optimization removes unnecessary set() conversions when comparing dictionary keys for consistency validation.

Key change: In the _validate_data method, the original code converted data[0].keys() and item.keys() to sets before comparison:

first_keys = set(data[0].keys())
if set(item.keys()) != first_keys:

The optimized version directly compares the dictionary key views:

first_keys = data[0].keys()
if item.keys() != first_keys:

Why this is faster: Dictionary key views (.keys()) can be compared directly without creating intermediate set objects. The set() constructor has overhead for allocating memory and copying keys, while key view comparison is a native, optimized operation in Python's dict implementation.

Performance gains are most significant for:

  • Large lists with many dictionaries (28-30% speedup in large-scale tests)
  • Lists with multiple small dictionaries (20-29% speedup in multi-dict scenarios)
  • Error cases that require key comparison before failing (19-30% speedup in validation failures)

The optimization maintains identical behavior since dictionary key views preserve insertion order and support equality comparison with the same semantics as sets for this use case.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 44 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union

# imports
import pytest  # used for our unit tests
from inference.enterprise.workflows.enterprise_blocks.sinks.microsoft_sql_server.v1 import \
    MicrosoftSQLServerSinkBlockV1

# unit tests

@pytest.fixture
def block():
    # Fixture to create a block instance for testing
    return MicrosoftSQLServerSinkBlockV1(None, None)

# --- Basic Test Cases ---

def test_single_dict_returns_list(block):
    # Should wrap a single dict in a list
    data = {"a": 1, "b": 2}
    codeflash_output = block._validate_data(data); result = codeflash_output # 650ns -> 576ns (12.8% faster)

def test_list_of_dicts_returns_same_list(block):
    # Should return the list unchanged if all dicts have same keys
    data = [{"a": 1, "b": 2}, {"a": 3, "b": 4}]
    codeflash_output = block._validate_data(data); result = codeflash_output # 3.88μs -> 3.01μs (28.9% faster)

def test_list_of_one_dict_returns_same_list(block):
    # Should return the list unchanged if only one dict
    data = [{"x": 10, "y": 20}]
    codeflash_output = block._validate_data(data); result = codeflash_output # 1.46μs -> 1.32μs (10.6% faster)

# --- Edge Test Cases ---

def test_empty_list_raises(block):
    # Should raise ValueError for empty list
    with pytest.raises(ValueError, match="Empty list provided for insert operation"):
        block._validate_data([]) # 1.17μs -> 946ns (23.6% faster)

def test_list_with_non_dict_raises(block):
    # Should raise ValueError if any item is not a dict
    data = [{"a": 1}, 42, {"a": 2}]
    with pytest.raises(ValueError, match="All items in data list must be dictionaries"):
        block._validate_data(data) # 2.29μs -> 1.92μs (19.5% faster)

def test_list_with_none_raises(block):
    # Should raise ValueError if any item is None
    data = [{"a": 1}, None]
    with pytest.raises(ValueError, match="All items in data list must be dictionaries"):
        block._validate_data(data) # 2.01μs -> 1.88μs (6.99% faster)

def test_list_with_different_keys_raises(block):
    # Should raise ValueError if dicts have different keys
    data = [{"a": 1, "b": 2}, {"a": 3, "c": 4}]
    with pytest.raises(ValueError, match="Dictionary at index 1 has different keys than the first dictionary"):
        block._validate_data(data) # 4.06μs -> 3.27μs (24.1% faster)

def test_list_with_extra_key_raises(block):
    # Should raise ValueError if one dict has extra keys
    data = [{"a": 1}, {"a": 2, "b": 3}]
    with pytest.raises(ValueError, match="Dictionary at index 1 has different keys than the first dictionary"):
        block._validate_data(data) # 3.47μs -> 2.86μs (21.3% faster)

def test_list_with_missing_key_raises(block):
    # Should raise ValueError if one dict has missing keys
    data = [{"a": 1, "b": 2}, {"a": 3}]
    with pytest.raises(ValueError, match="Dictionary at index 1 has different keys than the first dictionary"):
        block._validate_data(data) # 3.45μs -> 2.83μs (22.0% faster)

def test_dict_with_empty_keys(block):
    # Should accept a dict with no keys
    data = {}
    codeflash_output = block._validate_data(data); result = codeflash_output # 479ns -> 390ns (22.8% faster)

def test_list_of_dicts_with_empty_dicts(block):
    # Should accept list of empty dicts (keys are the same: empty)
    data = [{}, {}]
    codeflash_output = block._validate_data(data); result = codeflash_output # 3.04μs -> 2.59μs (17.5% faster)

def test_dict_with_non_string_keys(block):
    # Should accept dicts with non-string keys
    data = {1: "one", 2: "two"}
    codeflash_output = block._validate_data(data); result = codeflash_output # 484ns -> 433ns (11.8% faster)

def test_list_of_dicts_with_non_string_keys(block):
    # Should accept list of dicts with non-string keys (if keys match)
    data = [{1: "a", 2: "b"}, {1: "x", 2: "y"}]
    codeflash_output = block._validate_data(data); result = codeflash_output # 3.27μs -> 2.60μs (25.9% faster)

def test_list_with_nested_dicts(block):
    # Should accept dicts with nested dict values if keys match
    data = [{"a": {"x": 1}, "b": 2}, {"a": {"y": 2}, "b": 3}]
    codeflash_output = block._validate_data(data); result = codeflash_output # 2.90μs -> 2.40μs (21.1% faster)

def test_list_with_dicts_with_unhashable_values(block):
    # Should accept dicts with unhashable values (like lists)
    data = [{"a": [1, 2], "b": 2}, {"a": [3, 4], "b": 3}]
    codeflash_output = block._validate_data(data); result = codeflash_output # 2.89μs -> 2.24μs (29.1% faster)

# --- Large Scale Test Cases ---

def test_large_list_of_dicts(block):
    # Should handle a large list of dicts with identical keys
    data = [{"id": i, "val": i * 2} for i in range(1000)]
    codeflash_output = block._validate_data(data); result = codeflash_output # 128μs -> 99.6μs (28.7% faster)

def test_large_list_with_different_keys_raises(block):
    # Should raise ValueError if one dict in large list has different keys
    data = [{"id": i, "val": i * 2} for i in range(999)]
    data.append({"id": 1000})  # missing 'val'
    with pytest.raises(ValueError, match="Dictionary at index 999 has different keys than the first dictionary"):
        block._validate_data(data) # 129μs -> 99.0μs (30.5% faster)

def test_large_list_with_non_dict_raises(block):
    # Should raise ValueError if one item in large list is not a dict
    data = [{"id": i, "val": i * 2} for i in range(999)]
    data.append("not a dict")
    with pytest.raises(ValueError, match="All items in data list must be dictionaries"):
        block._validate_data(data) # 28.3μs -> 26.9μs (5.11% faster)

def test_large_list_of_empty_dicts(block):
    # Should accept large list of empty dicts
    data = [{} for _ in range(1000)]
    codeflash_output = block._validate_data(data); result = codeflash_output # 112μs -> 86.3μs (30.1% faster)


def test_input_is_not_dict_or_list(block):
    # Should raise TypeError if input is neither dict nor list
    # Note: The function does not explicitly handle this, so it will return None
    # We expect this to be a bug, so we check for None
    codeflash_output = block._validate_data("not a dict or list"); result = codeflash_output # 680ns -> 629ns (8.11% faster)

def test_input_is_tuple(block):
    # Should raise TypeError if input is a tuple
    codeflash_output = block._validate_data(({"a": 1}, {"a": 2})); result = codeflash_output # 678ns -> 520ns (30.4% faster)

def test_input_is_set(block):
    # Should raise TypeError if input is a set
    codeflash_output = block._validate_data({frozenset({"a": 1}), frozenset({"a": 2})}); result = codeflash_output # 558ns -> 482ns (15.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from inference.enterprise.workflows.enterprise_blocks.sinks.microsoft_sql_server.v1 import \
    MicrosoftSQLServerSinkBlockV1

# unit tests

@pytest.fixture
def block():
    # Fixture to create a fresh block instance for each test
    return MicrosoftSQLServerSinkBlockV1()

# ------------------------- BASIC TEST CASES -------------------------

def test_single_dict_returns_list(block):
    # Should wrap a single dict in a list
    data = {"a": 1, "b": 2}
    codeflash_output = block._validate_data(data); result = codeflash_output

def test_list_of_dicts_with_same_keys(block):
    # Should return the list unchanged if all dicts have same keys
    data = [{"a": 1, "b": 2}, {"a": 3, "b": 4}]
    codeflash_output = block._validate_data(data); result = codeflash_output

def test_list_of_one_dict(block):
    # Should accept a list with a single dict
    data = [{"x": 42}]
    codeflash_output = block._validate_data(data); result = codeflash_output

def test_list_of_dicts_with_empty_dicts_same_keys(block):
    # Should accept list of empty dicts (all have same keys: none)
    data = [{} for _ in range(3)]
    codeflash_output = block._validate_data(data); result = codeflash_output

# ------------------------- EDGE TEST CASES -------------------------

def test_empty_list_raises(block):
    # Should raise ValueError for empty list
    with pytest.raises(ValueError, match="Empty list provided for insert operation"):
        block._validate_data([])

def test_list_with_non_dict_raises(block):
    # Should raise ValueError if any item is not a dict
    data = [{"a": 1}, "not a dict", {"a": 2}]
    with pytest.raises(ValueError, match="All items in data list must be dictionaries"):
        block._validate_data(data)

def test_list_with_different_keys_raises(block):
    # Should raise ValueError if dicts have different keys
    data = [{"a": 1, "b": 2}, {"a": 3, "c": 4}]
    with pytest.raises(ValueError, match="Dictionary at index 1 has different keys than the first dictionary"):
        block._validate_data(data)

def test_list_with_extra_key_in_second_dict(block):
    # Should raise ValueError if one dict has extra key
    data = [{"a": 1}, {"a": 2, "b": 3}]
    with pytest.raises(ValueError, match="Dictionary at index 1 has different keys than the first dictionary"):
        block._validate_data(data)

def test_list_with_missing_key_in_second_dict(block):
    # Should raise ValueError if one dict is missing a key
    data = [{"a": 1, "b": 2}, {"a": 3}]
    with pytest.raises(ValueError, match="Dictionary at index 1 has different keys than the first dictionary"):
        block._validate_data(data)

def test_input_is_not_dict_or_list(block):
    # Should raise TypeError or do nothing if input is not dict or list
    # But as per implementation, will return None (implicit)
    # Let's assert that
    codeflash_output = block._validate_data("string")
    codeflash_output = block._validate_data(123)
    codeflash_output = block._validate_data(None)

def test_list_with_nested_dicts(block):
    # Should accept list of dicts even if values are dicts themselves, as long as keys match
    data = [{"a": {"nested": 1}}, {"a": {"nested": 2}}]
    codeflash_output = block._validate_data(data); result = codeflash_output

def test_list_of_dicts_with_different_key_order(block):
    # Should accept dicts with same keys in different order
    data = [{"a": 1, "b": 2}, {"b": 3, "a": 4}]
    codeflash_output = block._validate_data(data); result = codeflash_output

# ------------------------- LARGE SCALE TEST CASES -------------------------

def test_large_list_of_dicts_same_keys(block):
    # Should handle a large list of dicts with same keys efficiently
    data = [{"a": i, "b": i * 2} for i in range(1000)]
    codeflash_output = block._validate_data(data); result = codeflash_output

def test_large_list_of_dicts_with_one_different_key(block):
    # Should raise ValueError if one dict has a different key in a large list
    data = [{"a": i, "b": i * 2} for i in range(999)]
    data.append({"a": 1000, "c": 2000})  # last dict has different keys
    with pytest.raises(ValueError, match="Dictionary at index 999 has different keys than the first dictionary"):
        block._validate_data(data)

def test_large_list_all_empty_dicts(block):
    # Should accept large list of empty dicts (all have same keys: none)
    data = [{} for _ in range(1000)]
    codeflash_output = block._validate_data(data); result = codeflash_output

def test_large_list_with_non_dict_in_middle(block):
    # Should raise ValueError if a non-dict is present in a large list
    data = [{"x": i} for i in range(500)] + ["not a dict"] + [{"x": i} for i in range(499)]
    with pytest.raises(ValueError, match="All items in data list must be dictionaries"):
        block._validate_data(data)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-MicrosoftSQLServerSinkBlockV1._validate_data-mh9qo9t9 and push.

Codeflash

The optimization removes unnecessary `set()` conversions when comparing dictionary keys for consistency validation. 

**Key change:** In the `_validate_data` method, the original code converted `data[0].keys()` and `item.keys()` to sets before comparison:
```python
first_keys = set(data[0].keys())
if set(item.keys()) != first_keys:
```

The optimized version directly compares the dictionary key views:
```python
first_keys = data[0].keys()
if item.keys() != first_keys:
```

**Why this is faster:** Dictionary key views (`.keys()`) can be compared directly without creating intermediate set objects. The `set()` constructor has overhead for allocating memory and copying keys, while key view comparison is a native, optimized operation in Python's dict implementation.

**Performance gains are most significant for:**
- Large lists with many dictionaries (28-30% speedup in large-scale tests)
- Lists with multiple small dictionaries (20-29% speedup in multi-dict scenarios)
- Error cases that require key comparison before failing (19-30% speedup in validation failures)

The optimization maintains identical behavior since dictionary key views preserve insertion order and support equality comparison with the same semantics as sets for this use case.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 27, 2025 22:58
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant