BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules

# Bug report

The _BlocksOutputBuffer was introduced for all compression modules (bz2, lzma and zlib)  in python 3.10 https://github.com/python/cpython/pull/21740 . 

It performs well on its advertised quality: speedy and memory efficient creation of very large in-memory blocks.

However it came at the cost of: 
- Lots of extra code
- A performance regression in the common use case

The common use case is small things in memory. Having large in-memory buffers (multiple gigabytes) is an anti-pattern and streaming interfaces should be used instead. Which in turn utilize small buffers.

For a small buffer the BlocksOutputBuffer needs to create a list and a bytes object, while the 3.9 method only creates the bytes object. When the initial bytes object is too small, the blocksoutputbuffer creates another bytes object and resizes the list. The 3.9 method simply resizes the bytes object.
The 3.9 method does not scale well beyond a large number of resizings while the BlocksOutputBuffer does (due to cost amortization in the list resize and the fact that it doesn't resize bytes objects, saving on memcpy) but this only applies to very large buffers, and these should be rare.

This can be shown by removing the blocksoutputbuffer and reverting to the 3.9 method of arranging the output buffer. I have made a branch here https://github.com/rhpvorderman/cpython/tree/noblocks .

Microbenchmarks. Taking current README.rst (10044 bytes) and compressing with compression level 1.
BlocksoutputBuffer
```
$ ./python -m pyperf timeit -s 'data=open("README.rst", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 180 us +- 1 us
```
arrange_output_buffer:
```
$ ./python -m pyperf timeit -s 'data=open("README.rst", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 174 us +- 1 us
```
Also when taking a larger file Lib/_pydecimal.py (229220 uncompressed, 60974 bytes compressed) which certainly requires resizing the initial 16K buffer.
BlocksOutputBuffer:
```
$ ./python -m pyperf timeit -s 'data=open("Lib/_pydecimal.py", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 2.37 ms +- 0.01 ms
```
arrange_output_buffer
```
./python -m pyperf timeit -s 'data=open("Lib/_pydecimal.py", "rb").read(); from zlib import compress' 'compress(data, level=1)'
.....................
Mean +- std dev: 2.28 ms +- 0.01 ms
```
Q.E.D. BlocksOutputBuffer always loses against simple byte resizing for smaller buffers. _pydecimal.py is more than 200K so that is already quite a big input.

Additionally: 
```
 Modules/zlibmodule.c | 417 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------------------------------------------------------------------------------------
 1 file changed, 117 insertions(+), 300 deletions(-)
```
And this is not counting the removal of the extra header file that provides the BlocksOutputBuffer.

The BlocksOutputBuffer is a nice piece of work when taken in isolation, but it optimizes for the pathological case rather than the common case. It is detrimental to the common case and it requires a lot of extra code that needs to be maintained.

If there are some issues with slow performance on larger buffers with arrange_output_buffer, I think several optimizations can still be done which are not as invasive as the _BlocksOutputBufer. For instance when zlib.decompress is called on a 100MB object, it makes sense to start the output buffer at 100MB rather than at 16K (default in 3.9). This severely limits the amount of resizes required. The same goes for zlib.compress. If the input is 100MB the maximum output is going to be maximally 100MB+header and trailer size. Reserving 100MB first and then downscaling to the compressed size (say 20MB) is much quicker than resizing a 16K buffer to 20MB using doublings only. 










### Linked PRs
* gh-101279

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

Bug report

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

BlocksOutputBuffer causes a performance regression in bz2, lzma and zlib modules #101260

Description

Bug report

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions