Skip to content

Conversation

@jepler
Copy link

@jepler jepler commented Jul 9, 2021

Try to accurately measure the costs of including a word in the dictionary vs the gains from using it in messages.

This saves about 160 bytes on trinket_m0 ja, the fullest translation for that board before and after this change. Other translations on the same board all have savings, ranging from 24 to 228 bytes. The list below sorted the fullest translations (before the change) to the top. The numbers shown are the "bytes free in flash firmware" reported during a local build.

Translation     Before  After   Savings
ja              1164    1324    160
de_DE           1260    1396    136
fr              1424    1652    228
zh_Latn_pinyin  1448    1520    72
pt_BR           1584    1736    152
pl              1592    1640    48
es              1724    1816    92
ko              1724    1816    92
fil             1764    1800    36
it_IT           1896    2040    144
nl              1956    2136    180
ID              2072    2180    108
cs              2124    2148    24
sv              2340    2448    108
en_x_pirate     2644    2740    96
en_GB           2652    2752    100
el              2656    2768    112
en_US           2656    2768    112
hi              2656    2768    112

Deltas for adafruit_proxlight_trinkey_m0 in CI:

ja              1156    1292    136
de_DE           1280    1388    108
zh_Latn_pinyin  1472    1560    ...
fr              1508    1680
pt_BR           1604    1724
pl              1608    1648
es              1740    1820
fil             1764    1800
it_IT           1912    2044
nl              1940    2132
ID              2080    2204
sv              2340    2248
en_GB           2640    2740
en_x_pirate     2652    2748
en_US           2664    2748

jepler added 3 commits July 9, 2021 12:45
Try to accurately measure the costs of including a word in the dictionary
vs the gains from using it in messages.

This saves about 160 bytes on trinket_m0 ja, the fullest translation
for that board.  Other translations on the same board all have savings,
ranging from 24 to 228 bytes.

```
Translation     Before  After   Savings
ja              1164    1324    160
de_DE           1260    1396    136
fr              1424    1652    228
zh_Latn_pinyin  1448    1520    72
pt_BR           1584    1736    152
pl              1592    1640    48
es              1724    1816    92
ko              1724    1816    92
fil             1764    1800    36
it_IT           1896    2040    144
nl              1956    2136    180
ID              2072    2180    108
cs              2124    2148    24
sv              2340    2448    108
en_x_pirate     2644    2740    96
en_GB           2652    2752    100
el              2656    2768    112
en_US           2656    2768    112
hi              2656    2768    112
```
I was puzzled by why the dictionary words were sorted by length.
It was because TextSplitter sorted its parameter, instead of a copy.

This doesn't affect encoding size, but does affect the encoding NUMBER
of the found words.  We'll deliberately restore sorting by length next,
for other reasons, but not by spooky action.
@jepler
Copy link
Author

jepler commented Jul 9, 2021

The above stats apply to the FIRST commit only. The second commit may affect things a little bit, but only for the better. The problem with potentially accepting "negative valued" dictionary words only came into play with another change I'm testing locally which allowed for more dictionary terms.

@jepler
Copy link
Author

jepler commented Jul 11, 2021

I grabbed the stats across all boards. Here are the savings for all the builds that were within 200 bytes of full before this PR:

Board Translation Before After Savings
sensebox_mcu ja 4 160 156
sensebox_mcu de_DE 100 232 132
arduino_nano_33_iot ja 112 268 156
feather_m0_rfm69 ja 144 324 180
blm_badge ja 168 280 112
blm_badge de_DE 172 272 100
arduino_mkr1300 ja 192 352 160

Some boards were a slight loss (why?) but none of them are seriously full:

Board Translation Before After Savings
makerdiary_nrf52840_m2_devkit ID 300712 300672 -40
makerdiary_nrf52840_m2_devkit pl 299408 299372 -36
pyboard_v11 pl 450184 450148 -36
same54_xplained pl 553800 553764 -36
stm32f4_discovery pl 458544 458512 -32
feather_stm32f405_express pl 497164 497136 -28
pygamer pl 14216 14192 -24
pybadge pl 14628 14604 -24
winterbloom_big_honking_button fil 49000 48976 -24
makerdiary_m60_keyboard pl 301296 301280 -16
stm32f412zg_discovery pl 477960 477948 -12
grandcentral_m4_express pl 544380 544372 -8
feather_m4_can pl 22308 22304 -4
makerdiary_m60_keyboard ID 302604 302600 -4

Overall graph of the net change by board, sorted by savings:
image

@jepler jepler requested a review from tyomitch July 11, 2021 00:07
Copy link

@tyomitch tyomitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement in the heuristic's accuracy is excellent, I just wish the comment describing the new heuristic were more accurate.

Thanks to tyomitch for suggesting the comment could be more accurate.
@jepler jepler requested a review from tyomitch July 11, 2021 13:58
@jepler jepler merged commit 22e8a50 into adafruit:main Jul 11, 2021
@jepler jepler deleted the dictionary-better-heuristic branch November 3, 2021 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants