Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 27, 2025

📄 56% (0.56x) speedup for LSTMwRecDropout.forward in stanza/models/common/packed_lstm.py

⏱️ Runtime : 180 milliseconds 116 milliseconds (best of 50 runs)

📝 Explanation and details

The optimized code achieves a 55% speedup through several key performance improvements:

1. Reduced Attribute Lookups in Loops
The optimization caches frequently accessed attributes (self.num_layers, self.num_directions, self.cells, etc.) as local variables before the main loops. This eliminates repeated attribute lookups during the hot path execution, reducing overhead in the nested loops that process each layer and direction.

2. Optimized State Management in rnn_loop

  • Eliminated redundant unsqueeze(0) operations: The original code called unsqueeze(0) on each state update within the loop. The optimized version uses split(1, 0) which already returns tensors with the correct dimension, removing unnecessary tensor operations.
  • More efficient tensor slicing: Changed from x[st:st+bs] to x[st:end] with pre-calculated end = st + bs, reducing repeated arithmetic in the inner loop.

3. Reduced Generator Expression Overhead
The optimized version pre-computes hx_is_not_none = hx is not None and creates the generator expressions outside the critical path, avoiding repeated conditional checks and generator creation during each cell computation.

4. Better Memory Access Patterns
The optimized code groups related operations more efficiently, such as computing h and c states together and applying the recurrent dropout mask in a single operation, leading to better CPU cache utilization.

Performance Impact by Test Case:

  • Large batch tests (like test_forward_large_batch with 128 batch size) benefit most from reduced attribute lookups
  • Multi-layer tests see significant gains from the optimized state management
  • Bidirectional tests benefit from both the reduced overhead and better memory access patterns
  • Edge cases with small sequences still see improvements but with diminished relative gains

The line profiler shows the critical rnn_loop call time reduced from 214ms to 142ms (33% improvement), which drives the overall speedup since this represents 98% of the execution time.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 47 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
# function to test
import torch
import torch.nn as nn
from stanza.models.common.packed_lstm import LSTMwRecDropout
from torch.nn.utils.rnn import (PackedSequence, pack_padded_sequence,
                                pad_packed_sequence)

# unit tests

# ----------- BASIC TEST CASES -----------

def test_forward_basic_single_layer_unidirectional():
    # Test with a single layer, unidirectional, no dropout
    input_size = 3
    hidden_size = 5
    num_layers = 1
    batch_size = 2
    seq_lengths = [4, 2]  # first sequence: length 4, second: length 2

    # Create random input
    x = torch.randn(batch_size, max(seq_lengths), input_size)
    # Pack the sequence
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)

    # Instantiate LSTMwRecDropout
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    # Forward pass
    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_basic_multi_layer_bidirectional():
    # Test with two layers, bidirectional, no dropout
    input_size = 4
    hidden_size = 6
    num_layers = 2
    batch_size = 3
    seq_lengths = [5, 3, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)

    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_basic_with_initial_hidden():
    # Test with provided initial hidden and cell states
    input_size = 2
    hidden_size = 4
    num_layers = 1
    batch_size = 2
    seq_lengths = [3, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)

    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    # Initial hidden and cell state
    h0 = torch.randn(num_layers, batch_size, hidden_size)
    c0 = torch.randn(num_layers, batch_size, hidden_size)
    hx = (h0, c0)

    output, (h_n, c_n) = lstm.forward(packed, hx=hx)

# ----------- EDGE TEST CASES -----------


def test_forward_edge_single_time_step():
    # Test with one time step
    input_size = 2
    hidden_size = 3
    num_layers = 1
    batch_size = 1
    seq_lengths = [1]

    x = torch.randn(batch_size, 1, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_edge_all_sequences_same_length():
    # All sequences same length
    input_size = 3
    hidden_size = 4
    num_layers = 2
    batch_size = 4
    seq_lengths = [5, 5, 5, 5]

    x = torch.randn(batch_size, 5, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_edge_dropout_and_rec_dropout():
    # Test with dropout and recurrent dropout enabled
    input_size = 2
    hidden_size = 3
    num_layers = 2
    batch_size = 2
    seq_lengths = [4, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, dropout=0.5, rec_dropout=0.5)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_edge_bidirectional_and_dropout():
    # Test with bidirectional and dropout
    input_size = 3
    hidden_size = 4
    num_layers = 2
    batch_size = 2
    seq_lengths = [3, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True, dropout=0.3)

    output, (h_n, c_n) = lstm.forward(packed)


def test_forward_edge_non_sorted_sequences():
    # Sequences not sorted by length, enforce_sorted=False
    input_size = 2
    hidden_size = 3
    num_layers = 1
    batch_size = 3
    seq_lengths = [2, 3, 1]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

# ----------- LARGE SCALE TEST CASES -----------

def test_forward_large_batch_and_sequence():
    # Large batch and sequence length, but within memory limits
    input_size = 8
    hidden_size = 16
    num_layers = 2
    batch_size = 32
    seq_lengths = [50] * batch_size  # All sequences of length 50

    x = torch.randn(batch_size, 50, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_large_hidden_size():
    # Large hidden size, but within 100MB tensor limit
    input_size = 4
    hidden_size = 128  # 128*32*2*4 bytes = 32KB per layer
    num_layers = 2
    batch_size = 16
    seq_lengths = [20] * batch_size

    x = torch.randn(batch_size, 20, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_large_bidirectional():
    # Large batch, bidirectional, multi-layer
    input_size = 6
    hidden_size = 64
    num_layers = 3
    batch_size = 10
    seq_lengths = [30] * batch_size

    x = torch.randn(batch_size, 30, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_large_random_lengths():
    # Large batch with random sequence lengths
    input_size = 5
    hidden_size = 32
    num_layers = 2
    batch_size = 50
    seq_lengths = [torch.randint(5, 20, (1,)).item() for _ in range(batch_size)]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
# function to test
import torch
import torch.nn as nn
from stanza.models.common.packed_lstm import LSTMwRecDropout
from torch.nn.utils.rnn import (PackedSequence, pack_padded_sequence,
                                pad_packed_sequence)

# unit tests

# ----------- BASIC TEST CASES -----------




def test_forward_basic_with_hx():
    # Provide initial hidden/cell state
    input_size, hidden_size, num_layers = 2, 3, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    batch = 2
    hx = (
        torch.zeros(num_layers * lstm.num_directions, batch, hidden_size),
        torch.zeros(num_layers * lstm.num_directions, batch, hidden_size)
    )
    data = torch.tensor([
        [1., 2.], [3., 4.], [5., 6.], [7., 8.],  # seq 1
        [9., 10.], [11., 12.], [13., 14.], [15., 16.]  # seq 2
    ])
    lengths = torch.tensor([4, 4])
    packed = pack_padded_sequence(data.view(2, 4, 2), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed, hx=hx)
    # States shape
    for s in states:
        pass

# ----------- EDGE TEST CASES -----------


def test_forward_edge_one_step_sequence():
    # Test with batch of sequences, one of which is length 1
    input_size, hidden_size, num_layers = 2, 3, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    data = torch.tensor([
        [1., 2.], [3., 4.], [5., 6.],   # seq 1
        [7., 8.], [9., 10.], [11., 12.],  # seq 2
        [13., 14.], [15., 16.], [17., 18.] # seq 3
    ])
    lengths = torch.tensor([3, 3, 1])
    packed = pack_padded_sequence(data.view(3, 3, 2), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    # States shape
    for s in states:
        pass

def test_forward_edge_dropout_and_rec_dropout():
    # Test with nonzero dropout and recurrent dropout
    input_size, hidden_size, num_layers = 2, 3, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, dropout=0.5, rec_dropout=0.5)
    data = torch.tensor([
        [1., 2.], [3., 4.], [5., 6.], [7., 8.]
    ])
    lengths = torch.tensor([4])
    packed = pack_padded_sequence(data.view(1, 4, 2), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_edge_large_hidden_size():
    # Test with large hidden size
    input_size, hidden_size, num_layers = 2, 128, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    data = torch.randn(5, input_size)
    lengths = torch.tensor([5])
    packed = pack_padded_sequence(data.view(1, 5, input_size), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_edge_large_num_layers():
    # Test with large number of layers
    input_size, hidden_size, num_layers = 2, 8, 8
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    data = torch.randn(4, input_size)
    lengths = torch.tensor([4])
    packed = pack_padded_sequence(data.view(1, 4, input_size), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_edge_bidirectional_multi_layer():
    # Test with bidirectional and multiple layers
    input_size, hidden_size, num_layers = 2, 4, 3
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, bidirectional=True)
    data = torch.randn(5, input_size)
    lengths = torch.tensor([5])
    packed = pack_padded_sequence(data.view(1, 5, input_size), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

# ----------- LARGE SCALE TEST CASES -----------

def test_forward_large_batch():
    # Test with large batch size
    input_size, hidden_size, num_layers = 8, 16, 2
    batch = 128
    seq_len = 10
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_large_seq_len():
    # Test with large sequence length
    input_size, hidden_size, num_layers = 4, 8, 1
    batch = 4
    seq_len = 250
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    output, states = lstm.forward(packed)

def test_forward_large_hidden_and_layers():
    # Test with large hidden size and number of layers, but keep under 100MB
    input_size, hidden_size, num_layers = 16, 32, 4
    batch = 32
    seq_len = 16
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, dropout=0.2, rec_dropout=0.2)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_large_bidirectional():
    # Test with large batch, bidirectional
    input_size, hidden_size, num_layers = 8, 16, 2
    batch = 64
    seq_len = 16
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, bidirectional=True)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_large_varied_lengths():
    # Test with large batch and varied sequence lengths
    input_size, hidden_size, num_layers = 8, 16, 2
    batch = 100
    seq_lens = torch.randint(1, 20, (batch,))
    max_len = seq_lens.max().item()
    data = torch.zeros(batch, max_len, input_size)
    for i, l in enumerate(seq_lens):
        data[i, :l] = torch.randn(l, input_size)
    packed = pack_padded_sequence(data, seq_lens, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    output, states = lstm.forward(packed)
    for s in states:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-LSTMwRecDropout.forward-mh9mo6l1 and push.

Codeflash

The optimized code achieves a **55% speedup** through several key performance improvements:

**1. Reduced Attribute Lookups in Loops**
The optimization caches frequently accessed attributes (`self.num_layers`, `self.num_directions`, `self.cells`, etc.) as local variables before the main loops. This eliminates repeated attribute lookups during the hot path execution, reducing overhead in the nested loops that process each layer and direction.

**2. Optimized State Management in `rnn_loop`**
- **Eliminated redundant `unsqueeze(0)` operations**: The original code called `unsqueeze(0)` on each state update within the loop. The optimized version uses `split(1, 0)` which already returns tensors with the correct dimension, removing unnecessary tensor operations.
- **More efficient tensor slicing**: Changed from `x[st:st+bs]` to `x[st:end]` with pre-calculated `end = st + bs`, reducing repeated arithmetic in the inner loop.

**3. Reduced Generator Expression Overhead**
The optimized version pre-computes `hx_is_not_none = hx is not None` and creates the generator expressions outside the critical path, avoiding repeated conditional checks and generator creation during each cell computation.

**4. Better Memory Access Patterns**
The optimized code groups related operations more efficiently, such as computing `h` and `c` states together and applying the recurrent dropout mask in a single operation, leading to better CPU cache utilization.

**Performance Impact by Test Case:**
- **Large batch tests** (like `test_forward_large_batch` with 128 batch size) benefit most from reduced attribute lookups
- **Multi-layer tests** see significant gains from the optimized state management 
- **Bidirectional tests** benefit from both the reduced overhead and better memory access patterns
- **Edge cases** with small sequences still see improvements but with diminished relative gains

The line profiler shows the critical `rnn_loop` call time reduced from 214ms to 142ms (33% improvement), which drives the overall speedup since this represents 98% of the execution time.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 27, 2025 21:06
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant