Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 28, 2025

📄 140% (1.40x) speedup for cashflows_handler in gs_quant/risk/result_handlers.py

⏱️ Runtime : 60.4 milliseconds 25.1 milliseconds (best of 17 runs)

📝 Explanation and details

The optimization achieves a 140% speedup through three key changes:

1. Fast Date Parsing with Fallback Strategy

  • Replaced the lambda with dt.datetime.strptime() calls with a dedicated __str_to_date_fast() function
  • Uses direct string splitting and dt.date(int(year), int(month), int(day)) for the common 'YYYY-MM-DD' format
  • Falls back to strptime() only when the fast path fails
  • This optimization is most effective for large datasets with many date fields

2. Generator to List Comprehension Conversion

  • Changed records = ([row.get(field_from)...] for row in result) (generator) to records = [[row.get(field_from)...] for row in result] (list)
  • Eliminates the overhead of generator evaluation during DataFrame construction
  • Provides better memory locality for subsequent operations

3. Pandas-Style Apply Instead of Map

  • Replaced df[dt_col].map(lambda x: ...) with df[dt_col].apply(__str_to_date_fast)
  • The apply method is generally more efficient than map for DataFrame operations
  • Eliminates lambda function call overhead

Performance Impact by Test Case:

  • Large datasets see the biggest gains: 345% faster for 1000 cashflows, 262% faster for 500 varied dates
  • Small datasets see modest improvements: 1-8% faster for basic cases
  • Edge cases with non-string dates are slightly slower (0.5-7%) due to the additional isinstance check, but this is negligible compared to the gains on typical string date inputs

The optimizations are particularly effective for financial data processing where date parsing is a bottleneck and datasets contain hundreds to thousands of records.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 23 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import datetime as dt

# imports
import pytest
from gs_quant.risk.result_handlers import cashflows_handler


# Minimal stubs for dependencies (used only for typing, not for mocking)
class InstrumentBase:
    pass

class RiskKey:
    def __init__(self, key=None):
        self.key = key

class DataFrameWithInfo(list):
    def __init__(self, records=None, risk_key=None, request_id=None):
        super().__init__(records if records is not None else [])
        self.risk_key = risk_key
        self.request_id = request_id
        self.columns = []
    def __getitem__(self, item):
        # Support for accessing columns by name
        if isinstance(item, str):
            idx = self.columns.index(item)
            return [row[idx] for row in self]
        return super().__getitem__(item)
    def __setitem__(self, item, value):
        # Support for setting columns by name
        if isinstance(item, str):
            idx = self.columns.index(item)
            for i, v in enumerate(value):
                self[i][idx] = v
        else:
            super().__setitem__(item, value)
    def map(self, func):
        # For compatibility with pandas-like .map in tests
        return [func(x) for x in self]
from gs_quant.risk.result_handlers import cashflows_handler

# ------------------ UNIT TESTS ------------------

# Basic Test Cases

def test_empty_cashflows():
    # Test with empty cashflows list
    risk_key = RiskKey('basic')
    result = {'cashflows': []}
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 175μs -> 173μs (1.59% faster)


def test_multiple_cashflows_varied_fields():
    # Test with multiple cashflows, some fields None
    risk_key = RiskKey('basic')
    result = {
        'cashflows': [
            {
                'currency': 'EUR',
                'payDate': '2024-07-01',
                'setDate': '2024-06-28',
                'accStart': '2024-06-01',
                'accEnd': '2024-06-30',
                'payAmount': 200.0,
                'notional': 2000.0,
                'paymentType': 'Floating',
                'index': 'EURIBOR',
                'indexTerm': '6M',
                'dayCountFraction': 0.083,
                'spread': None,
                'rate': 0.03,
                'discountFactor': 0.97
            },
            {
                'currency': 'GBP',
                'payDate': '2024-08-01',
                'setDate': '2024-07-30',
                'accStart': '2024-07-01',
                'accEnd': '2024-07-31',
                'payAmount': 300.0,
                'notional': 3000.0,
                'paymentType': 'Fixed',
                'index': None,
                'indexTerm': None,
                'dayCountFraction': 0.084,
                'spread': 0.02,
                'rate': None,
                'discountFactor': 0.96
            }
        ]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 1.07ms -> 1.05ms (2.71% faster)

def test_request_id_preservation():
    # Test that request_id is preserved in output
    risk_key = RiskKey('basic')
    request_id = 'REQ-1234'
    result = {'cashflows': []}
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase(), request_id=request_id); df = codeflash_output # 167μs -> 162μs (3.44% faster)

# Edge Test Cases

def test_missing_fields_in_cashflow():
    # Test cashflow dict with missing keys
    risk_key = RiskKey('edge')
    result = {
        'cashflows': [{
            'currency': 'JPY',  # Only currency provided
        }]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 865μs -> 906μs (4.51% slower)
    # All columns except 'currency' should be None
    for col in df.columns:
        if col == 'currency':
            pass
        else:
            pass

def test_invalid_date_format():
    # Test with invalid date string (should raise ValueError)
    risk_key = RiskKey('edge')
    result = {
        'cashflows': [{
            'currency': 'USD',
            'payDate': '06/01/2024',  # Wrong format
            'setDate': '2024-05-30',
            'accStart': '2024-05-01',
            'accEnd': '2024-05-31',
            'payAmount': 100.0,
            'notional': 1000.0,
            'paymentType': 'Fixed',
            'index': 'LIBOR',
            'indexTerm': '3M',
            'dayCountFraction': 0.083,
            'spread': 0.01,
            'rate': 0.02,
            'discountFactor': 0.98
        }]
    }
    with pytest.raises(ValueError):
        cashflows_handler(result, risk_key, InstrumentBase()) # 437μs -> 453μs (3.59% slower)

def test_date_objects_pass_through():
    # Test with date objects instead of strings (should pass through)
    risk_key = RiskKey('edge')
    result = {
        'cashflows': [{
            'currency': 'USD',
            'payDate': dt.date(2024, 6, 1),
            'setDate': dt.date(2024, 5, 30),
            'accStart': dt.date(2024, 5, 1),
            'accEnd': dt.date(2024, 5, 31),
            'payAmount': 100.0,
            'notional': 1000.0,
            'paymentType': 'Fixed',
            'index': 'LIBOR',
            'indexTerm': '3M',
            'dayCountFraction': 0.083,
            'spread': 0.01,
            'rate': 0.02,
            'discountFactor': 0.98
        }]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 949μs -> 954μs (0.499% slower)

def test_none_dates():
    # Test with None for date fields (should remain None)
    risk_key = RiskKey('edge')
    result = {
        'cashflows': [{
            'currency': 'USD',
            'payDate': None,
            'setDate': None,
            'accStart': None,
            'accEnd': None,
            'payAmount': 100.0,
            'notional': 1000.0,
            'paymentType': 'Fixed',
            'index': 'LIBOR',
            'indexTerm': '3M',
            'dayCountFraction': 0.083,
            'spread': 0.01,
            'rate': 0.02,
            'discountFactor': 0.98
        }]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 886μs -> 956μs (7.33% slower)
    for col in ['payment_date', 'set_date', 'accrual_start_date', 'accrual_end_date']:
        pass

def test_unexpected_extra_fields():
    # Cashflow dict contains unexpected extra fields (should ignore them)
    risk_key = RiskKey('edge')
    result = {
        'cashflows': [{
            'currency': 'USD',
            'payDate': '2024-06-01',
            'setDate': '2024-05-30',
            'accStart': '2024-05-01',
            'accEnd': '2024-05-31',
            'payAmount': 100.0,
            'notional': 1000.0,
            'paymentType': 'Fixed',
            'index': 'LIBOR',
            'indexTerm': '3M',
            'dayCountFraction': 0.083,
            'spread': 0.01,
            'rate': 0.02,
            'discountFactor': 0.98,
            'unexpected_field': 'SHOULD_BE_IGNORED'
        }]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 997μs -> 966μs (3.28% faster)

def test_missing_cashflows_key():
    # Test with missing 'cashflows' key (should raise KeyError)
    risk_key = RiskKey('edge')
    result = {}
    with pytest.raises(KeyError):
        cashflows_handler(result, risk_key, InstrumentBase()) # 944ns -> 985ns (4.16% slower)

# Large Scale Test Cases

def test_large_number_of_cashflows():
    # Test with 1000 cashflows
    n = 1000
    risk_key = RiskKey('large')
    result = {
        'cashflows': [{
            'currency': 'USD',
            'payDate': '2024-06-01',
            'setDate': '2024-05-30',
            'accStart': '2024-05-01',
            'accEnd': '2024-05-31',
            'payAmount': float(i),
            'notional': float(i * 1000),
            'paymentType': 'Fixed' if i % 2 == 0 else 'Floating',
            'index': 'LIBOR' if i % 2 == 0 else 'EURIBOR',
            'indexTerm': '3M',
            'dayCountFraction': 0.083,
            'spread': 0.01,
            'rate': 0.02,
            'discountFactor': 0.98
        } for i in range(n)]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 14.8ms -> 3.31ms (345% faster)

def test_large_cashflows_varied_dates():
    # Test with 500 cashflows with varied date formats (all valid)
    n = 500
    risk_key = RiskKey('large')
    result = {
        'cashflows': [{
            'currency': 'USD',
            'payDate': f'2024-06-{(i%28)+1:02d}',
            'setDate': f'2024-05-{(i%28)+1:02d}',
            'accStart': f'2024-05-{(i%28)+1:02d}',
            'accEnd': f'2024-05-{(i%28)+2:02d}',
            'payAmount': float(i),
            'notional': float(i * 1000),
            'paymentType': 'Fixed',
            'index': 'LIBOR',
            'indexTerm': '3M',
            'dayCountFraction': 0.083,
            'spread': 0.01,
            'rate': 0.02,
            'discountFactor': 0.98
        } for i in range(n)]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 8.07ms -> 2.23ms (262% faster)
    # Check date conversion for a few rows
    for i in [0, n//2, n-1]:
        pass

def test_large_cashflows_missing_fields():
    # Test with 1000 cashflows, half missing some fields
    n = 1000
    risk_key = RiskKey('large')
    result = {
        'cashflows': [{
            'currency': 'USD' if i % 2 == 0 else None,
            'payDate': '2024-06-01' if i % 2 == 0 else None,
            'setDate': '2024-05-30' if i % 2 == 0 else None,
            'accStart': '2024-05-01' if i % 2 == 0 else None,
            'accEnd': '2024-05-31' if i % 2 == 0 else None,
            'payAmount': float(i) if i % 2 == 0 else None,
            'notional': float(i * 1000) if i % 2 == 0 else None,
            'paymentType': 'Fixed' if i % 2 == 0 else None,
            'index': 'LIBOR' if i % 2 == 0 else None,
            'indexTerm': '3M' if i % 2 == 0 else None,
            'dayCountFraction': 0.083 if i % 2 == 0 else None,
            'spread': 0.01 if i % 2 == 0 else None,
            'rate': 0.02 if i % 2 == 0 else None,
            'discountFactor': 0.98 if i % 2 == 0 else None
        } for i in range(n)]
    }
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 8.46ms -> 2.81ms (201% faster)
    # Check that odd-indexed rows have None for all fields
    for i in [1, 3, n-1]:
        for col in df.columns:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import datetime as dt
from typing import Iterable, Optional

# imports
import pytest  # used for our unit tests
from gs_quant.risk.result_handlers import cashflows_handler


# Minimal stub classes to allow the function to run in isolation
class InstrumentBase:
    pass

class RiskKey:
    def __init__(self, key: str = "default"):
        self.key = key

class DataFrameWithInfo(list):
    # Inherit from list for basic tabular structure, add columns/risk_key/request_id attributes
    def __init__(self, iterable=None, risk_key=None, request_id=None):
        super().__init__(iterable if iterable is not None else [])
        self.risk_key = risk_key
        self.request_id = request_id
        self.columns = []
    def __getitem__(self, key):
        # Allow column access by name (returns a list of values for that column)
        if key in self.columns:
            idx = self.columns.index(key)
            return [row[idx] for row in self]
        raise KeyError(key)
    def __setitem__(self, key, values):
        # Allow setting entire column by name
        if key not in self.columns:
            raise KeyError(key)
        idx = self.columns.index(key)
        for i, row in enumerate(self):
            row[idx] = values[i]
    def map_column(self, key, func):
        # Helper for mapping a function to a column
        idx = self.columns.index(key)
        for i, row in enumerate(self):
            row[idx] = func(row[idx])
from gs_quant.risk.result_handlers import cashflows_handler

# unit tests

# Helper function to build a single cashflow dict
def make_cashflow(**kwargs):
    # Fill all possible fields with defaults unless specified
    fields = {
        'currency': 'USD',
        'payDate': '2024-06-01',
        'setDate': '2024-05-30',
        'accStart': '2024-05-01',
        'accEnd': '2024-05-31',
        'payAmount': 100.0,
        'notional': 1000.0,
        'paymentType': 'Interest',
        'index': 'LIBOR',
        'indexTerm': '3M',
        'dayCountFraction': 0.0833,
        'spread': 0.01,
        'rate': 0.025,
        'discountFactor': 0.99
    }
    fields.update(kwargs)
    return fields

# 1. Basic Test Cases

def test_basic_single_cashflow():
    # Test with a single cashflow entry
    result = {'cashflows': [make_cashflow()]}
    risk_key = RiskKey("basic")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 1.03ms -> 976μs (4.99% faster)
    # Columns should match
    expected_columns = [
        'currency', 'payment_date', 'set_date', 'accrual_start_date', 'accrual_end_date',
        'payment_amount', 'notional', 'payment_type', 'floating_rate_option',
        'floating_rate_designated_maturity', 'day_count_fraction', 'spread', 'rate', 'discount_factor'
    ]

def test_basic_multiple_cashflows():
    # Test with multiple cashflow entries
    result = {'cashflows': [
        make_cashflow(payAmount=100.0, currency='USD'),
        make_cashflow(payAmount=200.0, currency='EUR', payDate='2024-07-01', setDate='2024-06-30', accStart='2024-06-01', accEnd='2024-06-30')
    ]}
    risk_key = RiskKey("multi")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase(), request_id="req123"); df = codeflash_output # 1.01ms -> 935μs (7.56% faster)


def test_empty_cashflows_list():
    # Test with empty cashflows list
    result = {'cashflows': []}
    risk_key = RiskKey("empty")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 178μs -> 175μs (1.63% faster)

def test_missing_cashflows_key():
    # Test with missing 'cashflows' key
    result = {}
    risk_key = RiskKey("missing_key")
    with pytest.raises(KeyError):
        cashflows_handler(result, risk_key, InstrumentBase()) # 1.00μs -> 894ns (12.0% faster)

def test_invalid_date_format():
    # Test with invalid date format (should raise ValueError)
    result = {'cashflows': [
        make_cashflow(payDate='06/01/2024')  # wrong format
    ]}
    risk_key = RiskKey("bad_date")
    with pytest.raises(ValueError):
        cashflows_handler(result, risk_key, InstrumentBase()) # 461μs -> 470μs (1.88% slower)

def test_non_string_date():
    # Test with date fields already as date objects (should pass through)
    result = {'cashflows': [
        make_cashflow(payDate=dt.date(2024, 6, 1), setDate=dt.date(2024, 5, 30),
                      accStart=dt.date(2024, 5, 1), accEnd=dt.date(2024, 5, 31))
    ]}
    risk_key = RiskKey("date_obj")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 923μs -> 974μs (5.15% slower)

def test_null_values_in_cashflow():
    # Test with None values in cashflow fields
    result = {'cashflows': [
        make_cashflow(currency=None, payAmount=None)
    ]}
    risk_key = RiskKey("nulls")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 1.02ms -> 949μs (7.13% faster)

def test_unexpected_extra_fields():
    # Test with extra fields in cashflow dict (should be ignored)
    result = {'cashflows': [
        dict(make_cashflow(), extra_field='SHOULD_BE_IGNORED')
    ]}
    risk_key = RiskKey("extra")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 1.01ms -> 960μs (5.04% faster)


def test_incorrect_field_types():
    # Test with incorrect types (e.g., string for notional)
    result = {'cashflows': [
        make_cashflow(notional='not_a_number')
    ]}
    risk_key = RiskKey("bad_type")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 1.08ms -> 1.04ms (4.12% faster)

# 3. Large Scale Test Cases



def test_large_scale_missing_fields():
    # Test with some cashflows missing different fields
    result = {'cashflows': [
        make_cashflow() if i % 2 == 0 else {'currency': 'EUR', 'payDate': '2024-06-01'}
        for i in range(100)
    ]}
    risk_key = RiskKey("partial")
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 1.95ms -> 1.25ms (55.7% faster)
    # Odd rows should have many None fields
    for i in range(1, 100, 2):
        for col in df.columns:
            if col not in ['currency', 'payment_date']:
                pass

def test_large_scale_performance():
    # Test performance for 999 cashflows (should complete quickly)
    import time
    result = {'cashflows': [
        make_cashflow(payAmount=i)
        for i in range(999)
    ]}
    risk_key = RiskKey("perf")
    start = time.time()
    codeflash_output = cashflows_handler(result, risk_key, InstrumentBase()); df = codeflash_output # 14.9ms -> 3.43ms (334% faster)
    elapsed = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-cashflows_handler-mhb1xgoa and push.

Codeflash

The optimization achieves a **140% speedup** through three key changes:

**1. Fast Date Parsing with Fallback Strategy**
- Replaced the lambda with `dt.datetime.strptime()` calls with a dedicated `__str_to_date_fast()` function
- Uses direct string splitting and `dt.date(int(year), int(month), int(day))` for the common 'YYYY-MM-DD' format
- Falls back to `strptime()` only when the fast path fails
- This optimization is most effective for large datasets with many date fields

**2. Generator to List Comprehension Conversion**
- Changed `records = ([row.get(field_from)...] for row in result)` (generator) to `records = [[row.get(field_from)...] for row in result]` (list)
- Eliminates the overhead of generator evaluation during DataFrame construction
- Provides better memory locality for subsequent operations

**3. Pandas-Style Apply Instead of Map**
- Replaced `df[dt_col].map(lambda x: ...)` with `df[dt_col].apply(__str_to_date_fast)`
- The `apply` method is generally more efficient than `map` for DataFrame operations
- Eliminates lambda function call overhead

**Performance Impact by Test Case:**
- **Large datasets see the biggest gains**: 345% faster for 1000 cashflows, 262% faster for 500 varied dates
- **Small datasets see modest improvements**: 1-8% faster for basic cases
- **Edge cases with non-string dates are slightly slower** (0.5-7%) due to the additional isinstance check, but this is negligible compared to the gains on typical string date inputs

The optimizations are particularly effective for financial data processing where date parsing is a bottleneck and datasets contain hundreds to thousands of records.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 28, 2025 21:01
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant