@@ -361,11 +361,26 @@ def est_len(occ):
361361 return lengths [idx ][1 ] + 1
362362
363363 # The cost of adding a dictionary word is just its storage size
364- # while its savings is the difference between the original
364+ # while its savings is close to the difference between the original
365365 # huffman bit-length of the string and the estimated bit-length
366366 # of the dictionary word, times the number of times the word appears.
367367 #
368- # The difference between the two is the net savings, in bits.
368+ # The savings is not strictly accurate because including a word into
369+ # the Huffman tree bumps up the encoding lengths of all words in the
370+ # same subtree. In the extreme case when the new word is so frequent
371+ # that it gets a one-bit encoding, all other words will cost an extra
372+ # bit each.
373+ #
374+ # Another source of inaccuracy is that compressed strings end up
375+ # on byte boundaries, not bit boundaries, so saving 1 bit somewhere
376+ # might not save a byte.
377+ #
378+ # In fact, when this change was first made, some translations (luckily,
379+ # ones on boards not at all close to full) wasted up to 40 bytes,
380+ # while the most constrained boards typically gained 100 bytes or
381+ # more.
382+ #
383+ # The difference between the two is the estimated net savings, in bits.
369384 def est_net_savings (s , occ ):
370385 savings = occ * (bit_length (s ) - est_len (occ ))
371386 cost = len (s ) * bits_per_codepoint
0 commit comments