⚡️ Speed up method `LSTMwRecDropout.forward` by 56% #263

codeflash-ai · 2025-10-27T21:06:30Z

📄 56% (0.56x) speedup for `LSTMwRecDropout.forward` in `stanza/models/common/packed_lstm.py`

⏱️ Runtime : 180 milliseconds → 116 milliseconds (best of 50 runs)

📝 Explanation and details

The optimized code achieves a 55% speedup through several key performance improvements:

1. Reduced Attribute Lookups in Loops
The optimization caches frequently accessed attributes (self.num_layers, self.num_directions, self.cells, etc.) as local variables before the main loops. This eliminates repeated attribute lookups during the hot path execution, reducing overhead in the nested loops that process each layer and direction.

2. Optimized State Management in rnn_loop

Eliminated redundant unsqueeze(0) operations: The original code called unsqueeze(0) on each state update within the loop. The optimized version uses split(1, 0) which already returns tensors with the correct dimension, removing unnecessary tensor operations.
More efficient tensor slicing: Changed from x[st:st+bs] to x[st:end] with pre-calculated end = st + bs, reducing repeated arithmetic in the inner loop.

3. Reduced Generator Expression Overhead
The optimized version pre-computes hx_is_not_none = hx is not None and creates the generator expressions outside the critical path, avoiding repeated conditional checks and generator creation during each cell computation.

4. Better Memory Access Patterns
The optimized code groups related operations more efficiently, such as computing h and c states together and applying the recurrent dropout mask in a single operation, leading to better CPU cache utilization.

Performance Impact by Test Case:

Large batch tests (like test_forward_large_batch with 128 batch size) benefit most from reduced attribute lookups
Multi-layer tests see significant gains from the optimized state management
Bidirectional tests benefit from both the reduced overhead and better memory access patterns
Edge cases with small sequences still see improvements but with diminished relative gains

The line profiler shows the critical rnn_loop call time reduced from 214ms to 142ms (33% improvement), which drives the overall speedup since this represents 98% of the execution time.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 47 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest  # used for our unit tests
# function to test
import torch
import torch.nn as nn
from stanza.models.common.packed_lstm import LSTMwRecDropout
from torch.nn.utils.rnn import (PackedSequence, pack_padded_sequence,
                                pad_packed_sequence)

# unit tests

# ----------- BASIC TEST CASES -----------

def test_forward_basic_single_layer_unidirectional():
    # Test with a single layer, unidirectional, no dropout
    input_size = 3
    hidden_size = 5
    num_layers = 1
    batch_size = 2
    seq_lengths = [4, 2]  # first sequence: length 4, second: length 2

    # Create random input
    x = torch.randn(batch_size, max(seq_lengths), input_size)
    # Pack the sequence
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)

    # Instantiate LSTMwRecDropout
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    # Forward pass
    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_basic_multi_layer_bidirectional():
    # Test with two layers, bidirectional, no dropout
    input_size = 4
    hidden_size = 6
    num_layers = 2
    batch_size = 3
    seq_lengths = [5, 3, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)

    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_basic_with_initial_hidden():
    # Test with provided initial hidden and cell states
    input_size = 2
    hidden_size = 4
    num_layers = 1
    batch_size = 2
    seq_lengths = [3, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)

    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    # Initial hidden and cell state
    h0 = torch.randn(num_layers, batch_size, hidden_size)
    c0 = torch.randn(num_layers, batch_size, hidden_size)
    hx = (h0, c0)

    output, (h_n, c_n) = lstm.forward(packed, hx=hx)

# ----------- EDGE TEST CASES -----------


def test_forward_edge_single_time_step():
    # Test with one time step
    input_size = 2
    hidden_size = 3
    num_layers = 1
    batch_size = 1
    seq_lengths = [1]

    x = torch.randn(batch_size, 1, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_edge_all_sequences_same_length():
    # All sequences same length
    input_size = 3
    hidden_size = 4
    num_layers = 2
    batch_size = 4
    seq_lengths = [5, 5, 5, 5]

    x = torch.randn(batch_size, 5, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_edge_dropout_and_rec_dropout():
    # Test with dropout and recurrent dropout enabled
    input_size = 2
    hidden_size = 3
    num_layers = 2
    batch_size = 2
    seq_lengths = [4, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, dropout=0.5, rec_dropout=0.5)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_edge_bidirectional_and_dropout():
    # Test with bidirectional and dropout
    input_size = 3
    hidden_size = 4
    num_layers = 2
    batch_size = 2
    seq_lengths = [3, 2]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True, dropout=0.3)

    output, (h_n, c_n) = lstm.forward(packed)


def test_forward_edge_non_sorted_sequences():
    # Sequences not sorted by length, enforce_sorted=False
    input_size = 2
    hidden_size = 3
    num_layers = 1
    batch_size = 3
    seq_lengths = [2, 3, 1]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

# ----------- LARGE SCALE TEST CASES -----------

def test_forward_large_batch_and_sequence():
    # Large batch and sequence length, but within memory limits
    input_size = 8
    hidden_size = 16
    num_layers = 2
    batch_size = 32
    seq_lengths = [50] * batch_size  # All sequences of length 50

    x = torch.randn(batch_size, 50, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_large_hidden_size():
    # Large hidden size, but within 100MB tensor limit
    input_size = 4
    hidden_size = 128  # 128*32*2*4 bytes = 32KB per layer
    num_layers = 2
    batch_size = 16
    seq_lengths = [20] * batch_size

    x = torch.randn(batch_size, 20, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_large_bidirectional():
    # Large batch, bidirectional, multi-layer
    input_size = 6
    hidden_size = 64
    num_layers = 3
    batch_size = 10
    seq_lengths = [30] * batch_size

    x = torch.randn(batch_size, 30, input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

    output, (h_n, c_n) = lstm.forward(packed)

def test_forward_large_random_lengths():
    # Large batch with random sequence lengths
    input_size = 5
    hidden_size = 32
    num_layers = 2
    batch_size = 50
    seq_lengths = [torch.randint(5, 20, (1,)).item() for _ in range(batch_size)]

    x = torch.randn(batch_size, max(seq_lengths), input_size)
    packed = pack_padded_sequence(x, seq_lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, batch_first=True)

    output, (h_n, c_n) = lstm.forward(packed)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
# function to test
import torch
import torch.nn as nn
from stanza.models.common.packed_lstm import LSTMwRecDropout
from torch.nn.utils.rnn import (PackedSequence, pack_padded_sequence,
                                pad_packed_sequence)

# unit tests

# ----------- BASIC TEST CASES -----------




def test_forward_basic_with_hx():
    # Provide initial hidden/cell state
    input_size, hidden_size, num_layers = 2, 3, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    batch = 2
    hx = (
        torch.zeros(num_layers * lstm.num_directions, batch, hidden_size),
        torch.zeros(num_layers * lstm.num_directions, batch, hidden_size)
    )
    data = torch.tensor([
        [1., 2.], [3., 4.], [5., 6.], [7., 8.],  # seq 1
        [9., 10.], [11., 12.], [13., 14.], [15., 16.]  # seq 2
    ])
    lengths = torch.tensor([4, 4])
    packed = pack_padded_sequence(data.view(2, 4, 2), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed, hx=hx)
    # States shape
    for s in states:
        pass

# ----------- EDGE TEST CASES -----------


def test_forward_edge_one_step_sequence():
    # Test with batch of sequences, one of which is length 1
    input_size, hidden_size, num_layers = 2, 3, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    data = torch.tensor([
        [1., 2.], [3., 4.], [5., 6.],   # seq 1
        [7., 8.], [9., 10.], [11., 12.],  # seq 2
        [13., 14.], [15., 16.], [17., 18.] # seq 3
    ])
    lengths = torch.tensor([3, 3, 1])
    packed = pack_padded_sequence(data.view(3, 3, 2), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    # States shape
    for s in states:
        pass

def test_forward_edge_dropout_and_rec_dropout():
    # Test with nonzero dropout and recurrent dropout
    input_size, hidden_size, num_layers = 2, 3, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, dropout=0.5, rec_dropout=0.5)
    data = torch.tensor([
        [1., 2.], [3., 4.], [5., 6.], [7., 8.]
    ])
    lengths = torch.tensor([4])
    packed = pack_padded_sequence(data.view(1, 4, 2), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_edge_large_hidden_size():
    # Test with large hidden size
    input_size, hidden_size, num_layers = 2, 128, 1
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    data = torch.randn(5, input_size)
    lengths = torch.tensor([5])
    packed = pack_padded_sequence(data.view(1, 5, input_size), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_edge_large_num_layers():
    # Test with large number of layers
    input_size, hidden_size, num_layers = 2, 8, 8
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    data = torch.randn(4, input_size)
    lengths = torch.tensor([4])
    packed = pack_padded_sequence(data.view(1, 4, input_size), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_edge_bidirectional_multi_layer():
    # Test with bidirectional and multiple layers
    input_size, hidden_size, num_layers = 2, 4, 3
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, bidirectional=True)
    data = torch.randn(5, input_size)
    lengths = torch.tensor([5])
    packed = pack_padded_sequence(data.view(1, 5, input_size), lengths, batch_first=True, enforce_sorted=False)
    output, states = lstm.forward(packed)
    for s in states:
        pass

# ----------- LARGE SCALE TEST CASES -----------

def test_forward_large_batch():
    # Test with large batch size
    input_size, hidden_size, num_layers = 8, 16, 2
    batch = 128
    seq_len = 10
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_large_seq_len():
    # Test with large sequence length
    input_size, hidden_size, num_layers = 4, 8, 1
    batch = 4
    seq_len = 250
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    output, states = lstm.forward(packed)

def test_forward_large_hidden_and_layers():
    # Test with large hidden size and number of layers, but keep under 100MB
    input_size, hidden_size, num_layers = 16, 32, 4
    batch = 32
    seq_len = 16
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, dropout=0.2, rec_dropout=0.2)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_large_bidirectional():
    # Test with large batch, bidirectional
    input_size, hidden_size, num_layers = 8, 16, 2
    batch = 64
    seq_len = 16
    data = torch.randn(batch, seq_len, input_size)
    lengths = torch.full((batch,), seq_len, dtype=torch.long)
    packed = pack_padded_sequence(data, lengths, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers, bidirectional=True)
    output, states = lstm.forward(packed)
    for s in states:
        pass

def test_forward_large_varied_lengths():
    # Test with large batch and varied sequence lengths
    input_size, hidden_size, num_layers = 8, 16, 2
    batch = 100
    seq_lens = torch.randint(1, 20, (batch,))
    max_len = seq_lens.max().item()
    data = torch.zeros(batch, max_len, input_size)
    for i, l in enumerate(seq_lens):
        data[i, :l] = torch.randn(l, input_size)
    packed = pack_padded_sequence(data, seq_lens, batch_first=True, enforce_sorted=False)
    lstm = LSTMwRecDropout(input_size, hidden_size, num_layers)
    output, states = lstm.forward(packed)
    for s in states:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-LSTMwRecDropout.forward-mh9mo6l1 and push.

The optimized code achieves a **55% speedup** through several key performance improvements: **1. Reduced Attribute Lookups in Loops** The optimization caches frequently accessed attributes (`self.num_layers`, `self.num_directions`, `self.cells`, etc.) as local variables before the main loops. This eliminates repeated attribute lookups during the hot path execution, reducing overhead in the nested loops that process each layer and direction. **2. Optimized State Management in `rnn_loop`** - **Eliminated redundant `unsqueeze(0)` operations**: The original code called `unsqueeze(0)` on each state update within the loop. The optimized version uses `split(1, 0)` which already returns tensors with the correct dimension, removing unnecessary tensor operations. - **More efficient tensor slicing**: Changed from `x[st:st+bs]` to `x[st:end]` with pre-calculated `end = st + bs`, reducing repeated arithmetic in the inner loop. **3. Reduced Generator Expression Overhead** The optimized version pre-computes `hx_is_not_none = hx is not None` and creates the generator expressions outside the critical path, avoiding repeated conditional checks and generator creation during each cell computation. **4. Better Memory Access Patterns** The optimized code groups related operations more efficiently, such as computing `h` and `c` states together and applying the recurrent dropout mask in a single operation, leading to better CPU cache utilization. **Performance Impact by Test Case:** - **Large batch tests** (like `test_forward_large_batch` with 128 batch size) benefit most from reduced attribute lookups - **Multi-layer tests** see significant gains from the optimized state management - **Bidirectional tests** benefit from both the reduced overhead and better memory access patterns - **Edge cases** with small sequences still see improvements but with diminished relative gains The line profiler shows the critical `rnn_loop` call time reduced from 214ms to 142ms (33% improvement), which drives the overall speedup since this represents 98% of the execution time.

codeflash-ai bot requested a review from mashraf-222 October 27, 2025 21:06

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `LSTMwRecDropout.forward` by 56% #263

⚡️ Speed up method `LSTMwRecDropout.forward` by 56% #263

Uh oh!

codeflash-ai bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method LSTMwRecDropout.forward by 56% #263

Are you sure you want to change the base?

⚡️ Speed up method LSTMwRecDropout.forward by 56% #263

Uh oh!

Conversation

codeflash-ai bot commented Oct 27, 2025

📄 56% (0.56x) speedup for LSTMwRecDropout.forward in stanza/models/common/packed_lstm.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `LSTMwRecDropout.forward` by 56% #263

⚡️ Speed up method `LSTMwRecDropout.forward` by 56% #263

📄 56% (0.56x) speedup for `LSTMwRecDropout.forward` in `stanza/models/common/packed_lstm.py`