Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 28, 2025

📄 11% (0.11x) speedup for build_exposure_df in gs_quant/markets/portfolio_manager_utils.py

⏱️ Runtime : 93.0 milliseconds 83.5 milliseconds (best of 76 runs)

📝 Explanation and details

The optimized code achieves an 11% speedup through two key vectorization improvements:

1. Vectorized Column Multiplication (Primary Optimization)
The original code used a loop to multiply each sensitivity column by notional values:

for column in columns:
    universe_sensitivities_df[column] = universe_sensitivities_df[column] * notional_df['Notional']

The optimized version uses vectorized NumPy operations:

notional_values = notional_df['Notional'].values
universe_sensitivities_df.loc[:, columns] = universe_sensitivities_df[columns].values * notional_values[:, None]

This eliminates the Python loop overhead and leverages NumPy's efficient broadcasting, which is particularly beneficial for larger datasets as shown in the test results.

2. Improved DataFrame Concatenation Pattern
Instead of chaining .agg("sum").to_frame().rename().T, the optimized code pre-creates the aggregated row with the correct name:

total_row = universe_sensitivities_df.agg("sum")
total_row.name = "Total Factor Category Exposure"
universe_sensitivities_df = pd.concat([universe_sensitivities_df, total_row.to_frame().T])

Performance Impact by Test Case:

  • Large-scale scenarios see the biggest gains (20-270% faster) where vectorization benefits compound
  • Small datasets show modest improvements or slight regressions due to vectorization overhead
  • Error cases are slower due to additional setup operations before exceptions

The optimizations particularly excel when processing many factors and assets simultaneously, making this well-suited for portfolio analysis workloads with substantial data volumes.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 36 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from collections import namedtuple

import pandas as pd
# imports
import pytest  # used for our unit tests
from gs_quant.markets.portfolio_manager_utils import build_exposure_df

# unit tests

# Helper for factor categories
FactorCategory = namedtuple('FactorCategory', ['name', 'id'])

# ---------- BASIC TEST CASES ----------

def test_basic_single_factor_empty_factor_data_by_name():
    # Test with one asset, one factor, by_name True, factor_data empty
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]}, index=[0])
    universe_sensitivities_df = pd.DataFrame({'Factor1': [50]}, index=[0])
    factor_categories = [FactorCategory(name='Factor1', id='F1')]
    factor_data = pd.DataFrame([])  # empty
    by_name = True

    # Run function
    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.12ms -> 2.20ms (3.80% slower)

def test_basic_multiple_factors_empty_factor_data_by_id():
    # Test with two assets, two factors, by_name False, factor_data empty
    notional_df = pd.DataFrame({'Asset Name': ['A', 'B'], 'Notional': [100, 200]}, index=[0, 1])
    universe_sensitivities_df = pd.DataFrame({'F1': [50, 60], 'F2': [30, 40]}, index=[0, 1])
    factor_categories = [FactorCategory(name='Factor1', id='F1'), FactorCategory(name='Factor2', id='F2')]
    factor_data = pd.DataFrame([])
    by_name = False

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.39ms -> 2.22ms (7.56% faster)

def test_basic_with_factor_data_by_name():
    # Test with factor_data present, by_name True
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]}, index=[0])
    universe_sensitivities_df = pd.DataFrame({'Factor1': [50]}, index=[0])
    factor_categories = [FactorCategory(name='Equity', id='EQ')]
    factor_data = pd.DataFrame({'name': ['Factor1'], 'factorCategory': ['Equity']})
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 3.70ms -> 3.80ms (2.73% slower)

def test_basic_with_factor_data_by_id():
    # Test with factor_data present, by_name False
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]}, index=[0])
    universe_sensitivities_df = pd.DataFrame({'F1': [50]}, index=[0])
    factor_categories = [FactorCategory(name='Equity', id='EQ')]
    factor_data = pd.DataFrame({'identifier': ['F1'], 'factorCategoryId': ['EQ']})
    by_name = False

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 3.68ms -> 3.85ms (4.37% slower)

# ---------- EDGE TEST CASES ----------

def test_empty_notional_df():
    # notional_df empty
    notional_df = pd.DataFrame({'Asset Name': [], 'Notional': []})
    universe_sensitivities_df = pd.DataFrame({'Factor1': []})
    factor_categories = [FactorCategory(name='Factor1', id='F1')]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.07ms -> 2.15ms (3.95% slower)

def test_empty_universe_sensitivities_df():
    # universe_sensitivities_df empty
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]})
    universe_sensitivities_df = pd.DataFrame({})
    factor_categories = []
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 1.68ms -> 1.79ms (6.18% slower)

def test_empty_factor_categories():
    # factor_categories empty
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]})
    universe_sensitivities_df = pd.DataFrame({'Factor1': [50]})
    factor_categories = []
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 1.94ms -> 2.07ms (6.45% slower)

def test_empty_factor_data_with_factor_categories():
    # factor_data empty but factor_categories present
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]})
    universe_sensitivities_df = pd.DataFrame({'Factor1': [50], 'Factor2': [60]})
    factor_categories = [FactorCategory(name='Factor1', id='F1')]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.26ms -> 2.22ms (1.68% faster)

def test_factor_data_with_missing_columns():
    # factor_data missing required columns
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]})
    universe_sensitivities_df = pd.DataFrame({'Factor1': [50]})
    factor_categories = [FactorCategory(name='Equity', id='EQ')]
    factor_data = pd.DataFrame({'wrong_name': ['Factor1'], 'wrong_category': ['Equity']})
    by_name = True

    # Should raise KeyError when trying to set index
    with pytest.raises(KeyError):
        build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name) # 254μs -> 580μs (56.1% slower)

def test_mismatched_factor_names():
    # universe_sensitivities_df columns not matching factor_data
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]})
    universe_sensitivities_df = pd.DataFrame({'UnknownFactor': [50]})
    factor_categories = [FactorCategory(name='Equity', id='EQ')]
    factor_data = pd.DataFrame({'name': ['Factor1'], 'factorCategory': ['Equity']})
    by_name = True

    # Should raise KeyError when looking up factor name
    with pytest.raises(KeyError):
        build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name) # 408μs -> 851μs (52.0% slower)

def test_zero_sensitivity():
    # Sensitivity is zero
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [100]})
    universe_sensitivities_df = pd.DataFrame({'Factor1': [0]})
    factor_categories = [FactorCategory(name='Factor1', id='F1')]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.09ms -> 2.22ms (5.56% slower)

def test_negative_notional_and_sensitivity():
    # Negative notional and negative sensitivity
    notional_df = pd.DataFrame({'Asset Name': ['A'], 'Notional': [-100]})
    universe_sensitivities_df = pd.DataFrame({'Factor1': [-50]})
    factor_categories = [FactorCategory(name='Factor1', id='F1')]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.13ms -> 2.20ms (3.18% slower)

def test_nan_notional_and_sensitivity():
    # NaN values in notional and sensitivity
    notional_df = pd.DataFrame({'Asset Name': ['A', 'B'], 'Notional': [100, float('nan')]})
    universe_sensitivities_df = pd.DataFrame({'Factor1': [50, float('nan')]})
    factor_categories = [FactorCategory(name='Factor1', id='F1')]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.19ms -> 2.27ms (3.82% slower)

# ---------- LARGE SCALE TEST CASES ----------

def test_large_scale_many_assets_factors():
    # 500 assets, 10 factors, factor_data present
    n_assets = 500
    n_factors = 10
    asset_names = [f'A{i}' for i in range(n_assets)]
    notional_values = [i*10 for i in range(n_assets)]
    notional_df = pd.DataFrame({'Asset Name': asset_names, 'Notional': notional_values})

    factor_names = [f'Factor{i}' for i in range(n_factors)]
    sensitivities = [[(j+1)*10 for _ in range(n_factors)] for j in range(n_assets)]
    universe_sensitivities_df = pd.DataFrame(sensitivities, columns=factor_names)

    factor_categories = [FactorCategory(name=f'Cat{i}', id=f'C{i}') for i in range(n_factors)]
    factor_data = pd.DataFrame({'name': factor_names, 'factorCategory': [f'Cat{i}' for i in range(n_factors)]})
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 5.05ms -> 4.20ms (20.2% faster)
    # Check that all columns are present
    for i in range(n_factors):
        pass

def test_large_scale_many_factors_empty_factor_data():
    # 100 assets, 50 factors, factor_data empty
    n_assets = 100
    n_factors = 50
    asset_names = [f'A{i}' for i in range(n_assets)]
    notional_values = [100 for _ in range(n_assets)]
    notional_df = pd.DataFrame({'Asset Name': asset_names, 'Notional': notional_values})

    factor_names = [f'F{i}' for i in range(n_factors)]
    sensitivities = [[i for i in range(n_factors)] for _ in range(n_assets)]
    universe_sensitivities_df = pd.DataFrame(sensitivities, columns=factor_names)

    factor_categories = [FactorCategory(name=f'F{i}', id=f'F{i}') for i in range(n_factors)]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 8.85ms -> 2.39ms (269% faster)
    for i in range(n_factors):
        pass
    # Check exposures for asset 0
    for i in range(n_factors):
        pass
    # Total row
    for i in range(n_factors):
        pass

def test_large_scale_all_zero_sensitivities():
    # Large scale, all sensitivities zero
    n_assets = 200
    n_factors = 5
    notional_df = pd.DataFrame({'Asset Name': [f'A{i}' for i in range(n_assets)], 'Notional': [100]*n_assets})
    universe_sensitivities_df = pd.DataFrame({f'F{i}': [0]*n_assets for i in range(n_factors)})
    factor_categories = [FactorCategory(name=f'F{i}', id=f'F{i}') for i in range(n_factors)]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.91ms -> 2.30ms (26.2% faster)

    for i in range(n_factors):
        pass

def test_large_scale_nan_in_some_rows():
    # Large scale, some NaNs in notional and sensitivities
    n_assets = 300
    n_factors = 3
    notional = [100 if i % 10 != 0 else float('nan') for i in range(n_assets)]
    sensitivities = [[50 if j % 2 == 0 else float('nan') for j in range(n_factors)] for i in range(n_assets)]
    notional_df = pd.DataFrame({'Asset Name': [f'A{i}' for i in range(n_assets)], 'Notional': notional})
    universe_sensitivities_df = pd.DataFrame(sensitivities, columns=[f'F{i}' for i in range(n_factors)])
    factor_categories = [FactorCategory(name=f'F{i}', id=f'F{i}') for i in range(n_factors)]
    factor_data = pd.DataFrame([])
    by_name = True

    codeflash_output = build_exposure_df(notional_df.copy(), universe_sensitivities_df.copy(), factor_categories, factor_data.copy(), by_name); result = codeflash_output # 2.80ms -> 2.76ms (1.36% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from collections import namedtuple

import pandas as pd
# imports
import pytest
from gs_quant.markets.portfolio_manager_utils import build_exposure_df

# unit tests

# Helper for factor categories
FactorCat = namedtuple("FactorCat", ["name", "id"])

# --- BASIC TEST CASES ---

def test_basic_single_factor_no_factor_data_by_name():
    # Single asset, single factor, no factor_data, by_name True
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"Equity": [50]})
    factor_categories = [FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.11ms -> 2.22ms (5.22% slower)

def test_basic_multiple_factors_no_factor_data_by_id():
    # Multiple assets, multiple factors, no factor_data, by_name False
    notional_df = pd.DataFrame({"Asset Name": ["A", "B"], "Notional": [1000, 2000]})
    sens_df = pd.DataFrame({1: [50, 60], 2: [20, 30]})
    factor_categories = [FactorCat("Equity", 1), FactorCat("Credit", 2)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=False); result = codeflash_output # 2.51ms -> 2.26ms (11.2% faster)

def test_basic_with_factor_data_by_name():
    # With factor_data, by_name True
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"EquityFactor": [50], "CreditFactor": [20]})
    factor_categories = [FactorCat("Equity", 1), FactorCat("Credit", 2)]
    factor_data = pd.DataFrame({
        "name": ["EquityFactor", "CreditFactor"],
        "identifier": [101, 102],
        "factorCategory": ["Equity", "Credit"],
        "factorCategoryId": [1, 2]
    })
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 3.93ms -> 3.90ms (0.990% faster)

def test_basic_with_factor_data_by_id():
    # With factor_data, by_name False
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({101: [50], 102: [20]})
    factor_categories = [FactorCat("Equity", 1), FactorCat("Credit", 2)]
    factor_data = pd.DataFrame({
        "name": ["EquityFactor", "CreditFactor"],
        "identifier": [101, 102],
        "factorCategory": ["Equity", "Credit"],
        "factorCategoryId": [1, 2]
    })
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=False); result = codeflash_output # 4.30ms -> 4.10ms (4.91% faster)

# --- EDGE TEST CASES ---

def test_empty_notional_df():
    # Notional df is empty
    notional_df = pd.DataFrame({"Asset Name": [], "Notional": []})
    sens_df = pd.DataFrame({"Equity": []})
    factor_categories = [FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.07ms -> 2.17ms (4.78% slower)

def test_empty_sensitivities_df():
    # Sensitivities df is empty
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame()
    factor_categories = []
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 1.71ms -> 1.78ms (4.40% slower)

def test_empty_factor_categories():
    # factor_categories is empty
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"Equity": [50]})
    factor_categories = []
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 1.98ms -> 2.08ms (4.67% slower)

def test_empty_factor_data_with_nonempty_categories():
    # factor_data empty, factor_categories non-empty
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"Equity": [50], "Credit": [20]})
    factor_categories = [FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.26ms -> 2.24ms (1.04% faster)

def test_factor_data_missing_factor():
    # factor_data missing a factor present in sensitivities
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"EquityFactor": [50], "CreditFactor": [20]})
    factor_categories = [FactorCat("Equity", 1), FactorCat("Credit", 2)]
    factor_data = pd.DataFrame({
        "name": ["EquityFactor"],
        "identifier": [101],
        "factorCategory": ["Equity"],
        "factorCategoryId": [1]
    })
    # Should raise KeyError when trying to map missing factor
    with pytest.raises(KeyError):
        build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True) # 576μs -> 820μs (29.7% slower)

def test_factor_categories_not_in_sensitivities():
    # factor_categories contains a category not present in sensitivities
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"Equity": [50]})
    factor_categories = [FactorCat("Credit", 2)]
    factor_data = pd.DataFrame()
    # Should raise KeyError for missing category
    with pytest.raises(KeyError):
        build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True) # 398μs -> 682μs (41.6% slower)

def test_negative_notional_and_sensitivity():
    # Negative notional and sensitivity
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [-1000]})
    sens_df = pd.DataFrame({"Equity": [-50]})
    factor_categories = [FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.13ms -> 2.22ms (4.17% slower)

def test_zero_sensitivity():
    # Zero sensitivity
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"Equity": [0]})
    factor_categories = [FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.10ms -> 2.22ms (5.03% slower)

def test_zero_notional():
    # Zero notional
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [0]})
    sens_df = pd.DataFrame({"Equity": [50]})
    factor_categories = [FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.12ms -> 2.19ms (3.24% slower)

def test_duplicate_factor_categories():
    # Duplicate factor categories in input
    notional_df = pd.DataFrame({"Asset Name": ["A"], "Notional": [1000]})
    sens_df = pd.DataFrame({"Equity": [50]})
    factor_categories = [FactorCat("Equity", 1), FactorCat("Equity", 1)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 2.17ms -> 2.22ms (2.17% slower)

# --- LARGE SCALE TEST CASES ---

def test_large_scale_many_assets_and_factors():
    # 100 assets, 10 factors
    assets = [f"A{i}" for i in range(100)]
    notional = [1000 + i for i in range(100)]
    notional_df = pd.DataFrame({"Asset Name": assets, "Notional": notional})
    factor_names = [f"F{i}" for i in range(10)]
    sens_data = {name: [i*10 for i in range(100)] for name in factor_names}
    sens_df = pd.DataFrame(sens_data)
    factor_categories = [FactorCat(name, i) for i, name in enumerate(factor_names)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 3.50ms -> 2.26ms (54.7% faster)
    for name in factor_names:
        pass

def test_large_scale_with_factor_data():
    # 50 assets, 5 factors, with factor_data
    assets = [f"A{i}" for i in range(50)]
    notional = [1000 + i for i in range(50)]
    notional_df = pd.DataFrame({"Asset Name": assets, "Notional": notional})
    factor_names = [f"F{i}" for i in range(5)]
    sens_data = {name: [i*10 for i in range(50)] for name in factor_names}
    sens_df = pd.DataFrame(sens_data)
    factor_categories = [FactorCat(name, i) for i, name in enumerate(factor_names)]
    factor_data = pd.DataFrame({
        "name": factor_names,
        "identifier": list(range(100, 105)),
        "factorCategory": factor_names,
        "factorCategoryId": list(range(5))
    })
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 4.29ms -> 3.91ms (9.71% faster)
    # MultiIndex columns
    for name in factor_names:
        pass

def test_large_scale_empty_factor_categories():
    # 100 assets, 10 factors, empty factor_categories
    assets = [f"A{i}" for i in range(100)]
    notional = [1000 + i for i in range(100)]
    notional_df = pd.DataFrame({"Asset Name": assets, "Notional": notional})
    factor_names = [f"F{i}" for i in range(10)]
    sens_data = {name: [i*10 for i in range(100)] for name in factor_names}
    sens_df = pd.DataFrame(sens_data)
    factor_categories = []
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 3.27ms -> 2.15ms (51.9% faster)
    # Should include all factor columns
    for name in factor_names:
        pass

def test_large_scale_empty_factor_data():
    # 100 assets, 10 factors, factor_data empty
    assets = [f"A{i}" for i in range(100)]
    notional = [1000 + i for i in range(100)]
    notional_df = pd.DataFrame({"Asset Name": assets, "Notional": notional})
    factor_names = [f"F{i}" for i in range(10)]
    sens_data = {name: [i*10 for i in range(100)] for name in factor_names}
    sens_df = pd.DataFrame(sens_data)
    factor_categories = [FactorCat(name, i) for i, name in enumerate(factor_names)]
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 3.51ms -> 2.29ms (53.2% faster)
    # Should not raise and should include all columns
    for name in factor_names:
        pass

def test_large_scale_empty_inputs():
    # All inputs empty
    notional_df = pd.DataFrame({"Asset Name": [], "Notional": []})
    sens_df = pd.DataFrame()
    factor_categories = []
    factor_data = pd.DataFrame()
    codeflash_output = build_exposure_df(notional_df, sens_df, factor_categories, factor_data, by_name=True); result = codeflash_output # 1.60ms -> 1.68ms (4.62% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-build_exposure_df-mhb51lmn and push.

Codeflash

The optimized code achieves an 11% speedup through two key vectorization improvements:

**1. Vectorized Column Multiplication (Primary Optimization)**
The original code used a loop to multiply each sensitivity column by notional values:
```python
for column in columns:
    universe_sensitivities_df[column] = universe_sensitivities_df[column] * notional_df['Notional']
```

The optimized version uses vectorized NumPy operations:
```python
notional_values = notional_df['Notional'].values
universe_sensitivities_df.loc[:, columns] = universe_sensitivities_df[columns].values * notional_values[:, None]
```

This eliminates the Python loop overhead and leverages NumPy's efficient broadcasting, which is particularly beneficial for larger datasets as shown in the test results.

**2. Improved DataFrame Concatenation Pattern**
Instead of chaining `.agg("sum").to_frame().rename().T`, the optimized code pre-creates the aggregated row with the correct name:
```python
total_row = universe_sensitivities_df.agg("sum")
total_row.name = "Total Factor Category Exposure"
universe_sensitivities_df = pd.concat([universe_sensitivities_df, total_row.to_frame().T])
```

**Performance Impact by Test Case:**
- **Large-scale scenarios** see the biggest gains (20-270% faster) where vectorization benefits compound
- **Small datasets** show modest improvements or slight regressions due to vectorization overhead
- **Error cases** are slower due to additional setup operations before exceptions

The optimizations particularly excel when processing many factors and assets simultaneously, making this well-suited for portfolio analysis workloads with substantial data volumes.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 28, 2025 22:28
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant