Skip to content

Allow Zstandard to decompress multiple concatenated frames #757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mkitti
Copy link
Contributor

@mkitti mkitti commented Jun 28, 2025

Currently, numcodecs assumes that a buffer, a pointer and a size, containing Zstandard compressed only consists of a single frame. However, this is not necessarily the case in that a single buffer may contain multiple frames. A buffer with multiple compressed frames is easily constructed by simply concatenating two encoded buffers. Currently, when decoding this buffer numcodecs.Zstd will report that the destination buffer is too small.

In [1]: import numcodecs

In [2]: codec = numcodecs.Zstd()

In [3]: hello_world = codec.encode(b"Hello world!")

In [4]: codec.decode(hello_world*2)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 1
----> 1 codec.decode(hello_world*2)

File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:221, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: b'Destination buffer is too small'

It will even do this when providing an output buffer of sufficient size.

In [5]: dest_buffer = bytearray(len(b"Hello world!")*2)

In [6]: codec.decode(hello_world*2, out=dest_buffer)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 codec.decode(hello_world*2, out=dest_buffer)

File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:221, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: b'Destination buffer is too small'

The expectation is that the decompressed output should simply the concatenation of the original uncompressed buffers.

In [7]: import pyzstd

In [8]: pyzstd.decompress(hello_world*2)
Out[8]: b'Hello world!Hello world!'

The flaw is that numcodecs.Zstd currently runs ZSTD_getFrameContentSize only once to get the uncompressed size of the first frame. It does not consider whether there may be other frames in the source buffer.

This pull request introduces a Cython function findTotalContentSize which iterates through all the frames in the buffer, gets the sizes of the uncompressed data from each frame, and then returns the sum. This allows numcodecs.Zstd to decompress the entire buffer.

>>> import numcodecs
>>> codec = numcodecs.Zstd()
>>> hello_world = codec.encode(b"Hello World!")
>>> codec.decode(hello_world*2)
b'Hello World!Hello World!'
>>> out = bytearray(len("Hello World!")*2)
>>> codec.decode(hello_world*2, out=out)
bytearray(b'Hello World!Hello World!')
>>> out
bytearray(b'Hello World!Hello World!')

One application of this is that a buffer containing multiple chunks of a Zarr v3 shard contiguously could be decompressed in a single step.

  • Add support for multiple zstd frames in decompression
  • Add release notes

[Description of PR]

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

Copy link

codecov bot commented Jun 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.99%. Comparing base (506c89b) to head (d67fba9).

❌ Your project check has failed because the head coverage (90.99%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (506c89b) and HEAD (d67fba9). Click for more details.

HEAD has 10 uploads less than BASE
Flag BASE (506c89b) HEAD (d67fba9)
15 5
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #757      +/-   ##
==========================================
- Coverage   99.96%   90.99%   -8.98%     
==========================================
  Files          64       64              
  Lines        2789     2809      +20     
==========================================
- Hits         2788     2556     -232     
- Misses          1      253     +252     
Files with missing lines Coverage Δ
numcodecs/tests/test_pyzstd.py 100.00% <ø> (ø)
numcodecs/tests/test_zstd.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mkitti
Copy link
Contributor Author

mkitti commented Jun 28, 2025

After this pull request

>>> from numcodecs import Zstd
>>> codec = Zstd()
>>> bytes = codec.encode(b"Hello ")
>>> codec.decode(bytes)
b'Hello '
>>> codec.decode(bytes+bytes)
b'Hello Hello '

@mkitti mkitti force-pushed the mkitti-multi-frame-zstd-clean branch 2 times, most recently from 09cc46c to 42ec173 Compare July 17, 2025 03:16
@mkitti mkitti force-pushed the mkitti-multi-frame-zstd-clean branch from 42ec173 to d67fba9 Compare July 17, 2025 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant