⚡️ Speed up function `extract_sentences` by 44% #275

codeflash-ai · 2025-10-27T23:56:34Z

📄 44% (0.44x) speedup for `extract_sentences` in `stanza/utils/datasets/ner/convert_he_iahlt.py`

⏱️ Runtime : 4.54 milliseconds → 3.14 milliseconds (best of 329 runs)

📝 Explanation and details

The optimized code achieves a 44% speedup through several targeted micro-optimizations that reduce overhead in critical hot paths:

Key optimizations:

Precompiled regex pattern - _RE_ENTITY_SPLIT = re.compile(r"([()])") eliminates repeated regex compilation overhead. The line profiler shows this saves significant time in the entity parsing loop.
Batched print output - In output_entities, instead of printing each entity individually (57.2% of original time), entities are collected and printed once with print("\n".join(entities)). This reduces I/O overhead from multiple print calls to a single call.
String optimization with partition() - Replaced piece.split("=", maxsplit=1)[1] with _, _, entity = piece.partition("=") for faster single-delimiter splitting.
Early filtering - Added if "Entity=" not in misc: continue to skip expensive splitting when no entities are present, avoiding unnecessary work on non-entity words.
Method localization - Stored words.append as append_word to avoid repeated attribute lookups in tight loops, reducing per-iteration overhead.
Optimized list operations - Used current_entity.pop() instead of current_entity[:-1] slicing, which is more efficient for stack-like operations.

Performance characteristics:

Most effective on documents with many non-entity words (benefits from early filtering)
Particularly good for documents with frequent entity annotations (benefits from batched printing)
The regex precompilation helps most when processing complex nested entities
All test cases show consistent speedups, with larger documents seeing proportionally better gains due to reduced per-iteration overhead

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 35 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import re

# imports
import pytest
from stanza.utils.datasets.ner.convert_he_iahlt import extract_sentences

# --- Test helpers: minimal mock classes for doc/sentence/word ---

class MockWord:
    def __init__(self, text, misc):
        self.text = text
        self.misc = misc

class MockSentence:
    def __init__(self, words, sent_id="1"):
        self.words = words
        self.sent_id = sent_id

class MockDoc:
    def __init__(self, sentences):
        self.sentences = sentences

# --- Basic Test Cases ---

def test_single_sentence_no_entities():
    # Sentence: "Hello world ." (no entities)
    words = [MockWord("Hello", None), MockWord("world", None), MockWord(".", None)]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # All words should be labeled "O"
    expected = [[("Hello", "O"), ("world", "O"), (".", "O")]]
    codeflash_output = extract_sentences(doc)

def test_single_sentence_single_entity():
    # Sentence: "Barack Obama visited Paris ." (Barack Obama = PER)
    words = [
        MockWord("Barack", "Entity=(PER"),
        MockWord("Obama", "Entity=)PER"),
        MockWord("visited", None),
        MockWord("Paris", None),
        MockWord(".", None)
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Only "Barack" and "Obama" are tagged as PER
    expected = [[
        ("Barack", "B-PER"),
        ("Obama", "I-PER"),
        ("visited", "O"),
        ("Paris", "O"),
        (".", "O")
    ]]
    codeflash_output = extract_sentences(doc)

def test_single_sentence_entity_at_end():
    # Sentence: "He lives in Paris" (Paris = LOC)
    words = [
        MockWord("He", None),
        MockWord("lives", None),
        MockWord("in", None),
        MockWord("Paris", "Entity=(LOC)LOC"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = [[
        ("He", "O"),
        ("lives", "O"),
        ("in", "O"),
        ("Paris", "B-LOC")
    ]]
    codeflash_output = extract_sentences(doc)

def test_multiple_sentences():
    # Two sentences, one with entity, one without
    words1 = [
        MockWord("John", "Entity=(PER)PER"),
        MockWord("runs", None)
    ]
    words2 = [
        MockWord("The", None),
        MockWord("dog", None)
    ]
    doc = MockDoc([MockSentence(words1, sent_id="1"), MockSentence(words2, sent_id="2")])
    expected = [
        [("John", "B-PER"), ("runs", "O")],
        [("The", "O"), ("dog", "O")]
    ]
    codeflash_output = extract_sentences(doc)

def test_entity_within_sentence():
    # Sentence: "The city of Paris is beautiful ." (Paris = LOC)
    words = [
        MockWord("The", None),
        MockWord("city", None),
        MockWord("of", None),
        MockWord("Paris", "Entity=(LOC)LOC"),
        MockWord("is", None),
        MockWord("beautiful", None),
        MockWord(".", None)
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = [[
        ("The", "O"),
        ("city", "O"),
        ("of", "O"),
        ("Paris", "B-LOC"),
        ("is", "O"),
        ("beautiful", "O"),
        (".", "O")
    ]]
    codeflash_output = extract_sentences(doc)

# --- Edge Test Cases ---

def test_nested_entities():
    # Sentence: "The [President [Barack Obama]]" (President=TITLE, Barack Obama=PER, nested)
    words = [
        MockWord("The", None),
        MockWord("President", "Entity=(TITLE"),
        MockWord("Barack", "Entity=(PER"),
        MockWord("Obama", "Entity=)PER)TITLE"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should label outermost entity only
    expected = [[
        ("The", "O"),
        ("President", "B-TITLE"),
        ("Barack", "I-TITLE"),
        ("Obama", "I-TITLE"),
    ]]
    codeflash_output = extract_sentences(doc)

def test_adjacent_entities():
    # Sentence: "Barack Obama and Angela Merkel" (Barack Obama=PER, Angela Merkel=PER)
    words = [
        MockWord("Barack", "Entity=(PER"),
        MockWord("Obama", "Entity=)PER"),
        MockWord("and", None),
        MockWord("Angela", "Entity=(PER"),
        MockWord("Merkel", "Entity=)PER"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = [[
        ("Barack", "B-PER"),
        ("Obama", "I-PER"),
        ("and", "O"),
        ("Angela", "B-PER"),
        ("Merkel", "I-PER"),
    ]]
    codeflash_output = extract_sentences(doc)

def test_entity_without_close():
    # Sentence: "Barack Obama is president" (Barack = PER, missing close)
    words = [
        MockWord("Barack", "Entity=(PER"),
        MockWord("Obama", None),
        MockWord("is", None),
        MockWord("president", None)
    ]
    sentence = MockSentence(words, sent_id="42")
    doc = MockDoc([sentence])
    # Should skip this sentence due to assertion error (unclosed entity)
    # extract_sentences returns only sentences that do not raise
    codeflash_output = extract_sentences(doc)

def test_entity_with_extra_close():
    # Sentence: "Barack )PER Obama" (close without open)
    words = [
        MockWord("Barack", "Entity=)PER"),
        MockWord("Obama", None)
    ]
    sentence = MockSentence(words, sent_id="99")
    doc = MockDoc([sentence])
    # Should skip this sentence due to assertion error (close without open)
    codeflash_output = extract_sentences(doc)

def test_entity_with_wrong_close():
    # Sentence: "Barack (PER )LOC" (open PER, close LOC)
    words = [
        MockWord("Barack", "Entity=(PER)LOC"),
    ]
    sentence = MockSentence(words, sent_id="100")
    doc = MockDoc([sentence])
    # Should skip this sentence due to assertion error (closed wrong entity)
    codeflash_output = extract_sentences(doc)

def test_entity_with_multiple_annotations():
    # Sentence: "Barack Obama" (Barack = PER, Obama = LOC)
    words = [
        MockWord("Barack", "Entity=(PER)PER"),
        MockWord("Obama", "Entity=(LOC)LOC"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Only the first entity is tracked, so both should be B-<entity>
    expected = [[
        ("Barack", "B-PER"),
        ("Obama", "B-LOC"),
    ]]
    codeflash_output = extract_sentences(doc)

def test_entity_with_misc_other_fields():
    # Sentence: "Paris" (LOC) but misc has extra fields
    words = [
        MockWord("Paris", "SomeField=foo|Entity=(LOC)LOC|OtherField=bar"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = [[("Paris", "B-LOC")]]
    codeflash_output = extract_sentences(doc)

def test_empty_sentence():
    # Sentence with no words
    sentence = MockSentence([])
    doc = MockDoc([sentence])
    expected = [[]]
    codeflash_output = extract_sentences(doc)

def test_sentence_with_only_misc_fields():
    # Sentence: "Hello" (misc but no Entity)
    words = [MockWord("Hello", "SomeField=bar")]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = [[("Hello", "O")]]
    codeflash_output = extract_sentences(doc)

def test_sentence_with_none_misc_and_entity():
    # Sentence: "Hello" (misc=None)
    words = [MockWord("Hello", None)]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = [[("Hello", "O")]]
    codeflash_output = extract_sentences(doc)

# --- Large Scale Test Cases ---

def test_many_sentences_and_words():
    # 100 sentences, each with 10 words, no entities
    sentences = []
    for i in range(100):
        words = [MockWord(f"word{j}", None) for j in range(10)]
        sentences.append(MockSentence(words, sent_id=str(i)))
    doc = MockDoc(sentences)
    expected = [[(f"word{j}", "O") for j in range(10)] for _ in range(100)]
    codeflash_output = extract_sentences(doc)

def test_long_sentence_with_entity():
    # Sentence: 100 words, entity from word 10 to word 20
    words = []
    for i in range(100):
        if i == 10:
            words.append(MockWord(f"word{i}", "Entity=(LONGENT"))
        elif i == 20:
            words.append(MockWord(f"word{i}", "Entity=)LONGENT"))
        else:
            words.append(MockWord(f"word{i}", None))
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    expected = []
    for i in range(100):
        if i == 10:
            expected.append((f"word{i}", "B-LONGENT"))
        elif 10 < i <= 20:
            expected.append((f"word{i}", "I-LONGENT"))
        else:
            expected.append((f"word{i}", "O"))
    codeflash_output = extract_sentences(doc)

def test_large_number_of_entities():
    # Sentence: 50 entities, each two words long, no overlap
    words = []
    expected = []
    for i in range(50):
        words.append(MockWord(f"entity{i}_1", "Entity=(E)E"))
        words.append(MockWord(f"entity{i}_2", None))
        expected.append((f"entity{i}_1", "B-E"))
        expected.append((f"entity{i}_2", "O"))
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    codeflash_output = extract_sentences(doc)

def test_large_sentence_with_nested_entities():
    # Sentence: 100 words, nested entity from 10-30 (A), inside that from 15-25 (B)
    words = []
    for i in range(100):
        if i == 10:
            words.append(MockWord(f"word{i}", "Entity=(A"))
        elif i == 15:
            words.append(MockWord(f"word{i}", "Entity=(B"))
        elif i == 25:
            words.append(MockWord(f"word{i}", "Entity=)B"))
        elif i == 30:
            words.append(MockWord(f"word{i}", "Entity=)A"))
        else:
            words.append(MockWord(f"word{i}", None))
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Only outermost entity is labeled
    expected = []
    for i in range(100):
        if i == 10:
            expected.append((f"word{i}", "B-A"))
        elif 10 < i <= 30:
            expected.append((f"word{i}", "I-A"))
        else:
            expected.append((f"word{i}", "O"))
    codeflash_output = extract_sentences(doc)

def test_large_sentence_with_all_entities():
    # Sentence: 100 words, each word is its own entity
    words = []
    expected = []
    for i in range(100):
        words.append(MockWord(f"word{i}", "Entity=(E)E"))
        expected.append((f"word{i}", "B-E"))
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    codeflash_output = extract_sentences(doc)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re

# imports
import pytest
from stanza.utils.datasets.ner.convert_he_iahlt import extract_sentences


# Mocks for sentence and word objects
class MockWord:
    def __init__(self, text, misc=None):
        self.text = text
        self.misc = misc

class MockSentence:
    def __init__(self, words, sent_id="1"):
        self.words = words
        self.sent_id = sent_id

class MockDoc:
    def __init__(self, sentences):
        self.sentences = sentences

# -------------------- UNIT TESTS --------------------

# 1. BASIC TEST CASES

def test_single_sentence_no_entities():
    # Sentence: "Hello world"
    words = [MockWord("Hello"), MockWord("world")]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # No entities, should return 'O' for each word
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_single_entity_entire_sentence():
    # Sentence: "Barack Obama"
    words = [
        MockWord("Barack", "Entity=(PERSON)"),
        MockWord("Obama", "Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should tag Barack as B-PERSON, Obama as I-PERSON
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_multiple_entities_non_overlapping():
    # Sentence: "Barack Obama visited Paris"
    words = [
        MockWord("Barack", "Entity=(PERSON)"),
        MockWord("Obama", "Entity=)"),
        MockWord("visited"),
        MockWord("Paris", "Entity=(LOCATION)Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_sentence_with_mixed_entities_and_none():
    # Sentence: "Alice went to Wonderland"
    words = [
        MockWord("Alice", "Entity=(PERSON)"),
        MockWord("went"),
        MockWord("to"),
        MockWord("Wonderland", "Entity=(LOCATION)Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_multiple_sentences():
    # Two sentences, one with entity, one without
    s1 = MockSentence([MockWord("John", "Entity=(PERSON)"), MockWord("Smith", "Entity=)")])
    s2 = MockSentence([MockWord("Hello"), MockWord("world")])
    doc = MockDoc([s1, s2])
    codeflash_output = extract_sentences(doc); result = codeflash_output

# 2. EDGE TEST CASES

def test_entity_with_nested_parentheses():
    # Sentence: "John (CEO) Smith"
    words = [
        MockWord("John", "Entity=(PERSON)"),
        MockWord("(CEO)", "Entity=(TITLE)Entity=)"),
        MockWord("Smith", "Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should tag John as B-PERSON, (CEO) as B-TITLE, Smith as I-PERSON
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_entity_with_missing_close():
    # Sentence: "Barack Obama"
    words = [
        MockWord("Barack", "Entity=(PERSON)"),
        MockWord("Obama"),  # missing close
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should not raise, but entity should persist
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_entity_with_extra_close():
    # Sentence: "Barack Obama"
    words = [
        MockWord("Barack", "Entity=(PERSON)"),
        MockWord("Obama", "Entity=)Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should raise assertion error and skip the sentence
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_entity_with_wrong_close():
    # Sentence: "Barack Obama"
    words = [
        MockWord("Barack", "Entity=(PERSON)"),
        MockWord("Obama", "Entity=(LOCATION)Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should raise assertion error and skip the sentence
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_entity_with_no_misc_field():
    # Sentence: "Hello world"
    words = [MockWord("Hello", None), MockWord("world", None)]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_entity_with_multiple_openings():
    # Sentence: "John Smith"
    words = [
        MockWord("John", "Entity=(PERSON)(TITLE)"),
        MockWord("Smith", "Entity=)Entity=)"),
    ]
    sentence = MockSentence(words)
    doc = MockDoc([sentence])
    # Should tag John as B-PERSON, Smith as I-PERSON (TITLE is ignored for tagging)
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_empty_sentence():
    # Sentence: ""
    sentence = MockSentence([])
    doc = MockDoc([sentence])
    codeflash_output = extract_sentences(doc); result = codeflash_output

def test_empty_doc():
    # Doc with no sentences
    doc = MockDoc([])
    codeflash_output = extract_sentences(doc); result = codeflash_output

# 3. LARGE SCALE TEST CASES

def test_large_doc_many_sentences():
    # 500 sentences, each with 2 words, alternating entity and non-entity
    sentences = []
    for i in range(500):
        if i % 2 == 0:
            words = [MockWord(f"Name{i}", "Entity=(PERSON)"), MockWord(f"Surname{i}", "Entity=)")]
        else:
            words = [MockWord(f"Hello{i}"), MockWord(f"World{i}")]
        sentences.append(MockSentence(words, sent_id=str(i)))
    doc = MockDoc(sentences)
    codeflash_output = extract_sentences(doc); result = codeflash_output
    # Check alternation
    for i, sent in enumerate(result):
        if i % 2 == 0:
            pass
        else:
            pass


def test_large_doc_all_entities():
    # 100 sentences, each with 10 words, all words are entities
    sentences = []
    for i in range(100):
        words = []
        for j in range(10):
            if j == 0:
                words.append(MockWord(f"Word{i}_{j}", "Entity=(TYPE)"))
            else:
                words.append(MockWord(f"Word{i}_{j}", "Entity=)"))
        sentences.append(MockSentence(words, sent_id=str(i)))
    doc = MockDoc(sentences)
    codeflash_output = extract_sentences(doc); result = codeflash_output
    for sent in result:
        for w in sent[1:]:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-extract_sentences-mh9sqvn0 and push.

The optimized code achieves a **44% speedup** through several targeted micro-optimizations that reduce overhead in critical hot paths: **Key optimizations:** 1. **Precompiled regex pattern** - `_RE_ENTITY_SPLIT = re.compile(r"([()])")` eliminates repeated regex compilation overhead. The line profiler shows this saves significant time in the entity parsing loop. 2. **Batched print output** - In `output_entities`, instead of printing each entity individually (57.2% of original time), entities are collected and printed once with `print("\n".join(entities))`. This reduces I/O overhead from multiple print calls to a single call. 3. **String optimization with `partition()`** - Replaced `piece.split("=", maxsplit=1)[1]` with `_, _, entity = piece.partition("=")` for faster single-delimiter splitting. 4. **Early filtering** - Added `if "Entity=" not in misc: continue` to skip expensive splitting when no entities are present, avoiding unnecessary work on non-entity words. 5. **Method localization** - Stored `words.append` as `append_word` to avoid repeated attribute lookups in tight loops, reducing per-iteration overhead. 6. **Optimized list operations** - Used `current_entity.pop()` instead of `current_entity[:-1]` slicing, which is more efficient for stack-like operations. **Performance characteristics:** - Most effective on documents with many non-entity words (benefits from early filtering) - Particularly good for documents with frequent entity annotations (benefits from batched printing) - The regex precompilation helps most when processing complex nested entities - All test cases show consistent speedups, with larger documents seeing proportionally better gains due to reduced per-iteration overhead

codeflash-ai bot requested a review from mashraf-222 October 27, 2025 23:56

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `extract_sentences` by 44% #275

⚡️ Speed up function `extract_sentences` by 44% #275

Uh oh!

codeflash-ai bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function extract_sentences by 44% #275

Are you sure you want to change the base?

⚡️ Speed up function extract_sentences by 44% #275

Uh oh!

Conversation

codeflash-ai bot commented Oct 27, 2025

📄 44% (0.44x) speedup for extract_sentences in stanza/utils/datasets/ner/convert_he_iahlt.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `extract_sentences` by 44% #275

⚡️ Speed up function `extract_sentences` by 44% #275

📄 44% (0.44x) speedup for `extract_sentences` in `stanza/utils/datasets/ner/convert_he_iahlt.py`