Skip to content

Conversation

@dukebody
Copy link

Trying to make the transliteration process faster without C compliation I've come up with something good enough.

Following the section caching, I've set up another dict cache to cache char equivalences.

Speed tests

Basic case

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 4.09 usec per loop

New code with char caching:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 2.41 usec per loop

This case represents a 40% speedup.

Best case scenario (all equal chars)

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 1.05 usec per loop

New caching code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 0.848 usec per loop

Bad case scenario (random chars)

Old code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.9 usec per loop

New code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.2 usec per loop

Considerations

  1. I know the bad case scenario I presented is not really the worst, since the worst would be transliterating always different characters on each call. That't harder to test, however extremely unlikely to happen in practice. I believe most uses will transliterate a reduced set of characters, and then the char cache will kick in and be faster. Bigger speedups probably correspond to transliterating non-ascii chars. For example, transliterating "áéíóúñ", which are the most common non-ascii chars in the Spanish language, takes less than 1 usec (in average) with caching, and almost 3 usecs without it (3x speedup).
  2. The character cache takes memory and I've not conducted memory tests because I don't know how. However I believe that (a) a dictionary {char → char} shouldn't take a lot of memory and (b) most practical uses will likely transliterate a reduced set of characters (from a reduced set of languages) so the cache shouldn't grow too big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant