⚡️ Speed up method LSTMwRecDropout.forward by 56%
#263
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 56% (0.56x) speedup for
LSTMwRecDropout.forwardinstanza/models/common/packed_lstm.py⏱️ Runtime :
180 milliseconds→116 milliseconds(best of50runs)📝 Explanation and details
The optimized code achieves a 55% speedup through several key performance improvements:
1. Reduced Attribute Lookups in Loops
The optimization caches frequently accessed attributes (
self.num_layers,self.num_directions,self.cells, etc.) as local variables before the main loops. This eliminates repeated attribute lookups during the hot path execution, reducing overhead in the nested loops that process each layer and direction.2. Optimized State Management in
rnn_loopunsqueeze(0)operations: The original code calledunsqueeze(0)on each state update within the loop. The optimized version usessplit(1, 0)which already returns tensors with the correct dimension, removing unnecessary tensor operations.x[st:st+bs]tox[st:end]with pre-calculatedend = st + bs, reducing repeated arithmetic in the inner loop.3. Reduced Generator Expression Overhead
The optimized version pre-computes
hx_is_not_none = hx is not Noneand creates the generator expressions outside the critical path, avoiding repeated conditional checks and generator creation during each cell computation.4. Better Memory Access Patterns
The optimized code groups related operations more efficiently, such as computing
handcstates together and applying the recurrent dropout mask in a single operation, leading to better CPU cache utilization.Performance Impact by Test Case:
test_forward_large_batchwith 128 batch size) benefit most from reduced attribute lookupsThe line profiler shows the critical
rnn_loopcall time reduced from 214ms to 142ms (33% improvement), which drives the overall speedup since this represents 98% of the execution time.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-LSTMwRecDropout.forward-mh9mo6l1and push.