Skip to content

Commit 0b8b16f

Browse files
committed
increase comment on accuracy of the net savings estimate function
Thanks to tyomitch for suggesting the comment could be more accurate.
1 parent fd4a7fc commit 0b8b16f

File tree

1 file changed

+17
-2
lines changed

1 file changed

+17
-2
lines changed

py/makeqstrdata.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -361,11 +361,26 @@ def est_len(occ):
361361
return lengths[idx][1] + 1
362362

363363
# The cost of adding a dictionary word is just its storage size
364-
# while its savings is the difference between the original
364+
# while its savings is close to the difference between the original
365365
# huffman bit-length of the string and the estimated bit-length
366366
# of the dictionary word, times the number of times the word appears.
367367
#
368-
# The difference between the two is the net savings, in bits.
368+
# The savings is not strictly accurate because including a word into
369+
# the Huffman tree bumps up the encoding lengths of all words in the
370+
# same subtree. In the extreme case when the new word is so frequent
371+
# that it gets a one-bit encoding, all other words will cost an extra
372+
# bit each.
373+
#
374+
# Another source of inaccuracy is that compressed strings end up
375+
# on byte boundaries, not bit boundaries, so saving 1 bit somewhere
376+
# might not save a byte.
377+
#
378+
# In fact, when this change was first made, some translations (luckily,
379+
# ones on boards not at all close to full) wasted up to 40 bytes,
380+
# while the most constrained boards typically gained 100 bytes or
381+
# more.
382+
#
383+
# The difference between the two is the estimated net savings, in bits.
369384
def est_net_savings(s, occ):
370385
savings = occ * (bit_length(s) - est_len(occ))
371386
cost = len(s) * bits_per_codepoint

0 commit comments

Comments
 (0)