Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 28, 2025

📄 9% (0.09x) speedup for sort_risk in gs_quant/risk/core.py

⏱️ Runtime : 18.4 milliseconds 16.9 milliseconds (best of 165 runs)

📝 Explanation and details

The optimized code achieves an 8% speedup through several key algorithmic and data structure improvements:

Primary Optimizations:

  1. Reduced list indexing overhead in sort_values: The original code created a sparse list fns with len(columns) elements but only populated specific indices. The optimized version uses compact lists indices and fns that are aligned by position, eliminating unnecessary None checks and sparse array access during sorting.

  2. Simplified sort key function: The optimized cmp function uses zip(indices, fns) to iterate through aligned indices and functions directly, avoiding repeated lookups into the sparse fns array. This reduces the per-row computational overhead during sorting.

  3. Efficient DataFrame construction: Replaced pd.DataFrame.from_records() with direct pd.DataFrame() constructor, which is faster for array-like data structures.

  4. Numpy array conversion optimization: Added .tolist() conversion for numpy arrays to ensure optimal iteration performance during sorting, as Python's sorted() function works more efficiently with native Python lists than numpy arrays.

Performance Impact by Test Case:

  • Large-scale tests show the biggest gains: 17.9-22.9% speedup on 500-1000 row datasets, where the sorting overhead reduction becomes most pronounced
  • Small datasets: Minimal impact (0-2% variation), as expected since the overhead is less significant
  • Edge cases: Some slight regressions on very small datasets due to the added .tolist() conversion check, but this is offset by gains on realistic data sizes

The optimization particularly excels with larger datasets where the cumulative effect of reduced per-row sorting overhead compounds significantly.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 17 Passed
⏪ Replay Tests 1 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest  # used for our unit tests
from gs_quant.risk.core import sort_risk

__risk_columns = ('date', 'time', 'mkt_type', 'mkt_asset', 'mkt_class', 'mkt_point')
from gs_quant.risk.core import sort_risk

# unit tests

# 1. Basic Test Cases

def test_sort_risk_basic_numeric_sort():
    # Test that rows are sorted by date and then by time
    df = pd.DataFrame([
        {'date': '2024-01-02', 'time': 2, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 5},
        {'date': '2024-01-01', 'time': 3, 'mkt_type': 'B', 'mkt_asset': 'Y', 'mkt_class': 'bar', 'mkt_point': 1},
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 2},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 669μs -> 669μs (0.058% faster)
    # Should sort by date, then time, then mkt_type, etc.
    expected_dates = ['2024-01-01', '2024-01-01', '2024-01-02']
    expected_times = [1, 3, 2]

def test_sort_risk_basic_string_sort():
    # Sorting by string columns
    df = pd.DataFrame([
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'B', 'mkt_asset': 'Y', 'mkt_class': 'foo', 'mkt_point': 2},
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'bar', 'mkt_point': 1},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 636μs -> 642μs (1.08% slower)

def test_sort_risk_basic_custom_by():
    # Sorting by custom column order
    df = pd.DataFrame([
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'B', 'mkt_asset': 'Y', 'mkt_class': 'foo', 'mkt_point': 2},
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'bar', 'mkt_point': 1},
    ])
    codeflash_output = sort_risk(df, by=('mkt_type', 'date')); sorted_df = codeflash_output # 659μs -> 674μs (2.16% slower)

def test_sort_risk_basic_missing_sort_columns():
    # If some sort columns are missing, should sort by available ones
    df = pd.DataFrame([
        {'date': '2024-01-02', 'foo': 1},
        {'date': '2024-01-01', 'foo': 2},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 520μs -> 516μs (0.715% faster)

# 2. Edge Test Cases

def test_sort_risk_empty_dataframe():
    # Should handle empty DataFrame gracefully
    df = pd.DataFrame(columns=['date', 'time', 'mkt_type', 'mkt_asset', 'mkt_class', 'mkt_point'])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 485μs -> 752μs (35.5% slower)

def test_sort_risk_single_row():
    # Should handle DataFrame with a single row
    df = pd.DataFrame([{'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 2}])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 637μs -> 636μs (0.134% faster)

def test_sort_risk_no_sort_columns():
    # If none of the sort columns are present, should retain original column order
    df = pd.DataFrame([
        {'foo': 1, 'bar': 2},
        {'foo': 2, 'bar': 1},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 316μs -> 317μs (0.212% slower)

def test_sort_risk_nonstandard_types():
    # Should handle mixed types in sort columns
    df = pd.DataFrame([
        {'date': '2024-01-01', 'time': None, 'mkt_type': 1, 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': '2pt'},
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'Y', 'mkt_class': 'bar', 'mkt_point': 1},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 646μs -> 657μs (1.60% slower)

def test_sort_risk_duplicate_rows():
    # Should preserve duplicates and sort correctly
    df = pd.DataFrame([
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 2},
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 2},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 629μs -> 631μs (0.280% slower)

def test_sort_risk_index_column_not_date():
    # If 'date' column is missing, should not set index
    df = pd.DataFrame([
        {'foo': 1, 'bar': 2},
        {'foo': 2, 'bar': 1},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 315μs -> 314μs (0.246% faster)

def test_sort_risk_column_order_preserved():
    # Columns not in 'by' should be preserved after sorted columns
    df = pd.DataFrame([
        {'date': '2024-01-01', 'extra': 99, 'time': 1, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 2},
        {'date': '2024-01-02', 'extra': 88, 'time': 2, 'mkt_type': 'B', 'mkt_asset': 'Y', 'mkt_class': 'bar', 'mkt_point': 1},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 668μs -> 666μs (0.230% faster)

def test_sort_risk_nan_values():
    # Should handle NaN values in sort columns
    import numpy as np
    df = pd.DataFrame([
        {'date': '2024-01-01', 'time': np.nan, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': 2},
        {'date': '2024-01-01', 'time': 1, 'mkt_type': 'B', 'mkt_asset': 'Y', 'mkt_class': 'bar', 'mkt_point': 1},
    ])
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 638μs -> 651μs (2.01% slower)

# 3. Large Scale Test Cases

def test_sort_risk_large_scale_sorted():
    # Test with 1000 rows, already sorted
    df = pd.DataFrame({
        'date': ['2024-01-01'] * 1000,
        'time': list(range(1000)),
        'mkt_type': ['A'] * 1000,
        'mkt_asset': ['X'] * 1000,
        'mkt_class': ['foo'] * 1000,
        'mkt_point': list(range(1000))
    })
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 2.90ms -> 2.46ms (17.9% faster)

def test_sort_risk_large_scale_reverse():
    # Test with 1000 rows, reverse sorted
    df = pd.DataFrame({
        'date': ['2024-01-01'] * 1000,
        'time': list(reversed(range(1000))),
        'mkt_type': ['A'] * 1000,
        'mkt_asset': ['X'] * 1000,
        'mkt_class': ['foo'] * 1000,
        'mkt_point': list(reversed(range(1000)))
    })
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 2.16ms -> 1.78ms (21.6% faster)

def test_sort_risk_large_scale_random():
    # Test with 500 rows, random order
    import random
    times = list(range(500))
    random.shuffle(times)
    df = pd.DataFrame({
        'date': ['2024-01-01'] * 500,
        'time': times,
        'mkt_type': ['A'] * 500,
        'mkt_asset': ['X'] * 500,
        'mkt_class': ['foo'] * 500,
        'mkt_point': times
    })
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 1.52ms -> 1.33ms (14.1% faster)

def test_sort_risk_large_scale_multiple_dates():
    # Test with 1000 rows, 10 dates, shuffled
    import random
    dates = ['2024-01-%02d' % d for d in range(1, 11)]
    rows = []
    for i in range(100):
        for d in dates:
            rows.append({'date': d, 'time': i, 'mkt_type': 'A', 'mkt_asset': 'X', 'mkt_class': 'foo', 'mkt_point': i})
    random.shuffle(rows)
    df = pd.DataFrame(rows)
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 2.36ms -> 1.98ms (19.0% faster)
    # Should be sorted by date, then time
    sorted_dates = sorted(dates)
    # Check that for each date, times are sorted
    for d in sorted_dates:
        times = list(sorted_df.loc[d]['time'])

def test_sort_risk_large_scale_performance():
    # Performance: Should not take excessive time for 1000 rows
    import time
    df = pd.DataFrame({
        'date': ['2024-01-01'] * 1000,
        'time': list(reversed(range(1000))),
        'mkt_type': ['A'] * 1000,
        'mkt_asset': ['X'] * 1000,
        'mkt_class': ['foo'] * 1000,
        'mkt_point': list(range(1000))
    })
    start = time.time()
    codeflash_output = sort_risk(df); sorted_df = codeflash_output # 2.20ms -> 1.79ms (22.9% faster)
    duration = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_gs_quanttestapitest_content_py_gs_quanttestanalyticstest_workspace_py_gs_quanttesttimeseriest__replay_test_0.py::test_gs_quant_risk_core_sort_risk 470μs 476μs -1.21%⚠️

To edit these changes git checkout codeflash/optimize-sort_risk-mhaz1i43 and push.

Codeflash

The optimized code achieves an **8% speedup** through several key algorithmic and data structure improvements:

**Primary Optimizations:**

1. **Reduced list indexing overhead in `sort_values`**: The original code created a sparse list `fns` with `len(columns)` elements but only populated specific indices. The optimized version uses compact lists `indices` and `fns` that are aligned by position, eliminating unnecessary None checks and sparse array access during sorting.

2. **Simplified sort key function**: The optimized `cmp` function uses `zip(indices, fns)` to iterate through aligned indices and functions directly, avoiding repeated lookups into the sparse `fns` array. This reduces the per-row computational overhead during sorting.

3. **Efficient DataFrame construction**: Replaced `pd.DataFrame.from_records()` with direct `pd.DataFrame()` constructor, which is faster for array-like data structures.

4. **Numpy array conversion optimization**: Added `.tolist()` conversion for numpy arrays to ensure optimal iteration performance during sorting, as Python's `sorted()` function works more efficiently with native Python lists than numpy arrays.

**Performance Impact by Test Case:**
- **Large-scale tests show the biggest gains**: 17.9-22.9% speedup on 500-1000 row datasets, where the sorting overhead reduction becomes most pronounced
- **Small datasets**: Minimal impact (0-2% variation), as expected since the overhead is less significant
- **Edge cases**: Some slight regressions on very small datasets due to the added `.tolist()` conversion check, but this is offset by gains on realistic data sizes

The optimization particularly excels with larger datasets where the cumulative effect of reduced per-row sorting overhead compounds significantly.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 28, 2025 19:40
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant