Cache char ASCII equivalence to speed-up transliteration #12

dukebody · 2015-10-24T10:25:29Z

Trying to make the transliteration process faster without C compliation I've come up with something good enough.

Following the section caching, I've set up another dict cache to cache char equivalences.

Speed tests

Basic case

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 4.09 usec per loop

New code with char caching:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 2.41 usec per loop

This case represents a 40% speedup.

Best case scenario (all equal chars)

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 1.05 usec per loop

New caching code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 0.848 usec per loop

Bad case scenario (random chars)

Old code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.9 usec per loop

New code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.2 usec per loop

Considerations

I know the bad case scenario I presented is not really the worst, since the worst would be transliterating always different characters on each call. That't harder to test, however extremely unlikely to happen in practice. I believe most uses will transliterate a reduced set of characters, and then the char cache will kick in and be faster. Bigger speedups probably correspond to transliterating non-ascii chars. For example, transliterating "áéíóúñ", which are the most common non-ascii chars in the Spanish language, takes less than 1 usec (in average) with caching, and almost 3 usecs without it (3x speedup).
The character cache takes memory and I've not conducted memory tests because I don't know how. However I believe that (a) a dictionary {char → char} shouldn't take a lot of memory and (b) most practical uses will likely transliterate a reduced set of characters (from a reduced set of languages) so the cache shouldn't grow too big.

dukebody added 2 commits October 24, 2015 11:53

Speedup by caching char ASCII equivalences.

66763c5

Tox to test under python3.4 and python2.7.

9f70336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache char ASCII equivalence to speed-up transliteration #12

Cache char ASCII equivalence to speed-up transliteration #12

Uh oh!

dukebody commented Oct 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cache char ASCII equivalence to speed-up transliteration #12

Are you sure you want to change the base?

Cache char ASCII equivalence to speed-up transliteration #12

Uh oh!

Conversation

dukebody commented Oct 24, 2015

Speed tests

Basic case

Best case scenario (all equal chars)

Bad case scenario (random chars)

Considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant