⚡️ Speed up function extract_sentences by 44%
#275
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 44% (0.44x) speedup for
extract_sentencesinstanza/utils/datasets/ner/convert_he_iahlt.py⏱️ Runtime :
4.54 milliseconds→3.14 milliseconds(best of329runs)📝 Explanation and details
The optimized code achieves a 44% speedup through several targeted micro-optimizations that reduce overhead in critical hot paths:
Key optimizations:
Precompiled regex pattern -
_RE_ENTITY_SPLIT = re.compile(r"([()])")eliminates repeated regex compilation overhead. The line profiler shows this saves significant time in the entity parsing loop.Batched print output - In
output_entities, instead of printing each entity individually (57.2% of original time), entities are collected and printed once withprint("\n".join(entities)). This reduces I/O overhead from multiple print calls to a single call.String optimization with
partition()- Replacedpiece.split("=", maxsplit=1)[1]with_, _, entity = piece.partition("=")for faster single-delimiter splitting.Early filtering - Added
if "Entity=" not in misc: continueto skip expensive splitting when no entities are present, avoiding unnecessary work on non-entity words.Method localization - Stored
words.appendasappend_wordto avoid repeated attribute lookups in tight loops, reducing per-iteration overhead.Optimized list operations - Used
current_entity.pop()instead ofcurrent_entity[:-1]slicing, which is more efficient for stack-like operations.Performance characteristics:
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-extract_sentences-mh9sqvn0and push.