⚡️ Speed up function `encoded_tokens_len` by 70% in PR #231 (`remove-tiktoken`) #235

codeflash-ai · 2025-05-21T01:49:10Z

⚡️ This pull request contains optimizations for PR #231

If you approve this dependent PR, these changes will be merged into the original PR branch remove-tiktoken.

This PR will be automatically closed if the original PR is merged.

📄 70% (0.70x) speedup for `encoded_tokens_len` in `codeflash/code_utils/code_utils.py`

⏱️ Runtime : 40.1 microseconds → 23.6 microseconds (best of 237 runs)

⚡️ This change will improve the performance of the following benchmarks:

Benchmark File :: Function	Original Runtime	Expected New Runtime	Speedup
tests.benchmarks.test_benchmark_code_extract_code_context::test_benchmark_extract	13.4 seconds	13.4 seconds	0.00%

📝 Explanation and details

Here is an optimized version of your code. The bottleneck is minimal as the computation is a single multiplication and a cast to int, which is already fast. However, a very minor optimization can be done by avoiding the int() call for many cases by using integer division directly.

You can also remove the __future__ import, as annotations is default since Python 3.7.

Here is an optimized version.

This avoids floating point multiplication and conversion overhead, and gives the same result as int(len(s)*0.25) for non-negative integer len(s).

✅ Correctness verification report:

Test	Status
⏪ Replay Tests	✅ 3 Passed
⚙️ Existing Unit Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
🌀 Generated Regression Tests	✅ 63 Passed
📊 Tests Coverage

🌀 Generated Regression Tests Details

from __future__ import annotations

# imports
import pytest  # used for our unit tests
from codeflash.code_utils.code_utils import encoded_tokens_len

# unit tests

# ----------------------
# Basic Test Cases
# ----------------------

def test_empty_string():
    # An empty string should yield 0 tokens
    codeflash_output = encoded_tokens_len('')

def test_single_character():
    # 1 character, 1*0.25 = 0.25 -> int() = 0
    codeflash_output = encoded_tokens_len('a')

def test_three_characters():
    # 3*0.25 = 0.75 -> int() = 0
    codeflash_output = encoded_tokens_len('abc')

def test_four_characters():
    # 4*0.25 = 1
    codeflash_output = encoded_tokens_len('abcd')

def test_eight_characters():
    # 8*0.25 = 2
    codeflash_output = encoded_tokens_len('abcdefgh')

def test_typical_sentence():
    # "Hello world!" is 12 chars, 12*0.25=3
    codeflash_output = encoded_tokens_len('Hello world!')

def test_typical_sentence_with_spaces():
    # "The quick brown fox" is 19 chars, 19*0.25=4.75 -> 4
    codeflash_output = encoded_tokens_len('The quick brown fox')

# ----------------------
# Edge Test Cases
# ----------------------

def test_non_ascii_characters():
    # Unicode emoji: each emoji is a single codepoint, but may be more bytes
    # 4 emojis: 4*0.25=1
    codeflash_output = encoded_tokens_len('😀😁😂🤣')

def test_mixed_ascii_and_unicode():
    # 3 ascii + 2 emoji = 5 chars, 5*0.25=1.25 -> 1
    codeflash_output = encoded_tokens_len('ab😀😁')

def test_long_word_no_spaces():
    # 17 chars, 17*0.25=4.25 -> 4
    codeflash_output = encoded_tokens_len('supercalifragilistic')  # Actually 21 chars, 21*0.25=5.25 -> 5

def test_only_spaces():
    # 10 spaces, 10*0.25=2.5 -> 2
    codeflash_output = encoded_tokens_len(' ' * 10)

def test_newlines_and_tabs():
    # '\n\t\n\t' is 4 chars, 4*0.25=1
    codeflash_output = encoded_tokens_len('\n\t\n\t')

def test_highest_value_before_next_token():
    # 7 chars, 7*0.25=1.75 -> 1
    codeflash_output = encoded_tokens_len('1234567')
    # 8 chars, 8*0.25=2
    codeflash_output = encoded_tokens_len('12345678')

def test_large_unicode_string():
    # 100 unicode chars
    s = 'ü' * 100
    codeflash_output = encoded_tokens_len(s)

def test_surrogate_pairs():
    # Some emojis are surrogate pairs in UTF-16, but Python counts codepoints
    s = '👩‍👩‍👧‍👦'  # Family emoji, actually 7 codepoints
    codeflash_output = encoded_tokens_len(s)  # 7*0.25=1.75->1

def test_non_string_input_raises():
    # Should raise TypeError if input is not a string
    with pytest.raises(TypeError):
        encoded_tokens_len(123)
    with pytest.raises(TypeError):
        encoded_tokens_len(None)
    with pytest.raises(TypeError):
        encoded_tokens_len(['a', 'b'])

# ----------------------
# Large Scale Test Cases
# ----------------------

def test_long_string_exact_1000():
    # 1000 chars, 1000*0.25=250
    s = 'a' * 1000
    codeflash_output = encoded_tokens_len(s)

def test_long_string_non_multiple_of_four():
    # 999 chars, 999*0.25=249.75 -> 249
    s = 'b' * 999
    codeflash_output = encoded_tokens_len(s)

def test_large_mixed_string():
    # 500 ascii + 500 unicode = 1000 chars, should be 250
    s = 'a' * 500 + '😀' * 500
    codeflash_output = encoded_tokens_len(s)

def test_performance_large_string():
    # Should not take excessive time for 999 chars
    s = 'x' * 999
    codeflash_output = encoded_tokens_len(s); result = codeflash_output

def test_all_ascii_printable():
    # 95 printable ascii chars, 95*0.25=23.75->23
    import string
    s = string.printable
    expected = int(len(s)*0.25)
    codeflash_output = encoded_tokens_len(s)

def test_all_unicode_plane_1_subset():
    # 100 different unicode chars from U+1000 to U+1063
    s = ''.join(chr(0x1000 + i) for i in range(100))
    codeflash_output = encoded_tokens_len(s)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import string  # used for generating test strings

# imports
import pytest  # used for our unit tests
from codeflash.code_utils.code_utils import encoded_tokens_len

# unit tests

# ----------------------
# Basic Test Cases
# ----------------------

def test_empty_string():
    # Test with empty string should return 0
    codeflash_output = encoded_tokens_len("")

def test_single_ascii_character():
    # Test with a single ASCII character should return 0 (since int(1*0.25) == 0)
    codeflash_output = encoded_tokens_len("a")

def test_four_ascii_characters():
    # Test with 4 ASCII characters should return 1 (since int(4*0.25) == 1)
    codeflash_output = encoded_tokens_len("abcd")

def test_regular_sentence():
    # Test with a regular English sentence
    s = "The quick brown fox jumps over the lazy dog."
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_sentence_with_spaces_and_punctuation():
    # Test with sentence containing spaces and punctuation
    s = "Hello, world! How are you?"
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_sentence_with_numbers():
    # Test with sentence containing numbers
    s = "1234567890"
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

# ----------------------
# Edge Test Cases
# ----------------------

def test_string_length_just_below_token_boundary():
    # Length 3: int(3*0.25) == 0
    codeflash_output = encoded_tokens_len("abc")

def test_string_length_exact_token_boundary():
    # Length 4: int(4*0.25) == 1
    codeflash_output = encoded_tokens_len("abcd")

def test_string_length_just_above_token_boundary():
    # Length 5: int(5*0.25) == 1
    codeflash_output = encoded_tokens_len("abcde")

def test_non_ascii_characters():
    # Test with non-ASCII (e.g., emoji, accented characters)
    s = "ñáü😊"
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_mixed_unicode_and_ascii():
    # Test with a mix of ASCII and Unicode
    s = "hello世界😊"
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_whitespace_only():
    # Test with string of only spaces
    s = "    "
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_newlines_and_tabs():
    # Test with newlines and tabs
    s = "\n\t\n\t"
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_long_repeating_character():
    # Test with a long string of a single character
    s = "a" * 100
    expected = int(100 * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_surrogate_pairs():
    # Test with characters outside BMP (e.g., emoji that are surrogate pairs)
    s = "😀" * 8  # Each emoji is one codepoint in Python 3
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_highest_unicode_codepoint():
    # Test with the highest valid Unicode codepoint
    s = chr(0x10FFFF)
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_null_character():
    # Test with null character in string
    s = "\x00" * 8
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

# ----------------------
# Large Scale Test Cases
# ----------------------

def test_very_long_ascii_string():
    # Test with a long ASCII string (length 1000)
    s = "a" * 1000
    expected = int(1000 * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_very_long_unicode_string():
    # Test with a long Unicode string (length 1000)
    s = "😊" * 1000
    expected = int(1000 * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_mixed_long_string():
    # Test with a long string mixing ASCII, digits, punctuation, and Unicode
    chars = string.ascii_letters + string.digits + string.punctuation + "😊ñ"
    s = "".join(chars[i % len(chars)] for i in range(999))
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_large_string_with_newlines():
    # Test with a large string containing lots of newlines
    s = ("abc\n" * 250)  # 4*250 = 1000 chars
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

def test_large_string_with_varied_whitespace():
    # Test with a large string containing spaces, tabs, and newlines
    s = (" \t\n" * 333) + " "  # 3*333 + 1 = 1000
    expected = int(len(s) * 0.25)
    codeflash_output = encoded_tokens_len(s)

# ----------------------
# Specificity/Mutation-Resistant Cases
# ----------------------

@pytest.mark.parametrize("input_str,expected", [
    ("", 0),                  # empty string
    ("a", 0),                 # 1 char
    ("ab", 0),                # 2 chars
    ("abc", 0),               # 3 chars
    ("abcd", 1),              # 4 chars
    ("abcde", 1),             # 5 chars
    ("abcdef", 1),            # 6 chars
    ("abcdefgh", 2),          # 8 chars
    ("abcdefghij", 2),        # 10 chars
    ("abcdefghijklmno", 3),   # 15 chars
    ("abcdefghijklmnop", 4),  # 16 chars
    ("abcdefghijklmnopqrst", 5), # 20 chars
    ("a" * 999, int(999 * 0.25)), # boundary large
])
def test_parametrized_cases(input_str, expected):
    # Parametrized test for various lengths and boundaries
    codeflash_output = encoded_tokens_len(input_str)

def test_non_string_input_raises():
    # Test that non-string input raises a TypeError
    with pytest.raises(TypeError):
        encoded_tokens_len(123)  # type: ignore

    with pytest.raises(TypeError):
        encoded_tokens_len(None)  # type: ignore

    with pytest.raises(TypeError):
        encoded_tokens_len([1, 2, 3])  # type: ignore
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from codeflash.code_utils.code_utils import encoded_tokens_len

def test_encoded_tokens_len():
    encoded_tokens_len('')

To edit these changes git checkout codeflash/optimize-pr231-2025-05-21T01.49.04 and push.

…tiktoken`) Here is an optimized version of your code. The bottleneck is minimal as the computation is a single multiplication and a cast to int, which is already fast. However, a very minor optimization can be done by avoiding the `int()` call for many cases by using integer division directly. You can also remove the `__future__` import, as `annotations` is default since Python 3.7. Here is an optimized version. This avoids floating point multiplication and conversion overhead, and gives the same result as `int(len(s)*0.25)` for non-negative integer `len(s)`.

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label May 21, 2025

codeflash-ai bot mentioned this pull request May 21, 2025

Hotfix for tiktoken removal #231

Merged

misrasaurabh1 closed this May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `encoded_tokens_len` by 70% in PR #231 (`remove-tiktoken`) #235

⚡️ Speed up function `encoded_tokens_len` by 70% in PR #231 (`remove-tiktoken`) #235

Uh oh!

codeflash-ai bot commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function encoded_tokens_len by 70% in PR #231 (remove-tiktoken) #235

⚡️ Speed up function encoded_tokens_len by 70% in PR #231 (remove-tiktoken) #235

Uh oh!

Conversation

codeflash-ai bot commented May 21, 2025

⚡️ This pull request contains optimizations for PR #231

📄 70% (0.70x) speedup for encoded_tokens_len in codeflash/code_utils/code_utils.py

⚡️ This change will improve the performance of the following benchmarks:

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `encoded_tokens_len` by 70% in PR #231 (`remove-tiktoken`) #235

⚡️ Speed up function `encoded_tokens_len` by 70% in PR #231 (`remove-tiktoken`) #235

📄 70% (0.70x) speedup for `encoded_tokens_len` in `codeflash/code_utils/code_utils.py`