Allow Zstandard to decompress multiple concatenated frames #757

mkitti · 2025-06-28T11:28:44Z

Currently, numcodecs assumes that a buffer, a pointer and a size, containing Zstandard compressed only consists of a single frame. However, this is not necessarily the case in that a single buffer may contain multiple frames. A buffer with multiple compressed frames is easily constructed by simply concatenating two encoded buffers. Currently, when decoding this buffer numcodecs.Zstd will report that the destination buffer is too small.

In [1]: import numcodecs

In [2]: codec = numcodecs.Zstd()

In [3]: hello_world = codec.encode(b"Hello world!")

In [4]: codec.decode(hello_world*2)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 1
----> 1 codec.decode(hello_world*2)

File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:221, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: b'Destination buffer is too small'

It will even do this when providing an output buffer of sufficient size.

In [5]: dest_buffer = bytearray(len(b"Hello world!")*2)

In [6]: codec.decode(hello_world*2, out=dest_buffer)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 codec.decode(hello_world*2, out=dest_buffer)

File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:221, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: b'Destination buffer is too small'

The expectation is that the decompressed output should simply the concatenation of the original uncompressed buffers.

In [7]: import pyzstd

In [8]: pyzstd.decompress(hello_world*2)
Out[8]: b'Hello world!Hello world!'

The flaw is that numcodecs.Zstd currently runs ZSTD_getFrameContentSize only once to get the uncompressed size of the first frame. It does not consider whether there may be other frames in the source buffer.

This pull request introduces a Cython function findTotalContentSize which iterates through all the frames in the buffer, gets the sizes of the uncompressed data from each frame, and then returns the sum. This allows numcodecs.Zstd to decompress the entire buffer.

>>> import numcodecs
>>> codec = numcodecs.Zstd()
>>> hello_world = codec.encode(b"Hello World!")
>>> codec.decode(hello_world*2)
b'Hello World!Hello World!'
>>> out = bytearray(len("Hello World!")*2)
>>> codec.decode(hello_world*2, out=out)
bytearray(b'Hello World!Hello World!')
>>> out
bytearray(b'Hello World!Hello World!')

One application of this is that a buffer containing multiple chunks of a Zarr v3 shard contiguously could be decompressed in a single step.

Add support for multiple zstd frames in decompression
Add release notes

[Description of PR]

TODO:

Unit tests and/or doctests in docstrings
Tests pass locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
Docs build locally
GitHub Actions CI passes
Test coverage to 100% (Codecov passes)

codecov · 2025-06-28T11:30:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.99%. Comparing base (506c89b) to head (d67fba9).

❌ Your project check has failed because the head coverage (90.99%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (506c89b) and HEAD (d67fba9). Click for more details.

HEAD has 10 uploads less than BASE

Flag BASE (506c89b) HEAD (d67fba9)

15 5

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #757      +/-   ##
==========================================
- Coverage   99.96%   90.99%   -8.98%     
==========================================
  Files          64       64              
  Lines        2789     2809      +20     
==========================================
- Hits         2788     2556     -232     
- Misses          1      253     +252

Files with missing lines	Coverage Δ
numcodecs/tests/test_pyzstd.py	`100.00% <ø> (ø)`
numcodecs/tests/test_zstd.py	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mkitti · 2025-06-28T11:30:56Z

After this pull request

>>> from numcodecs import Zstd
>>> codec = Zstd()
>>> bytes = codec.encode(b"Hello ")
>>> codec.decode(bytes)
b'Hello '
>>> codec.decode(bytes+bytes)
b'Hello Hello '

mkitti mentioned this pull request Jul 3, 2025

Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707

Merged

7 tasks

mkitti force-pushed the mkitti-multi-frame-zstd-clean branch 2 times, most recently from 09cc46c to 42ec173 Compare July 17, 2025 03:16

mkitti added 9 commits July 16, 2025 23:42

Add support for multiple zstd frames in decompression

8ebb7a2

Add release notes

28d92ed

Format with ruff

5c7fecb

Address MSVC type errors

7835f41

Explicitly declare return type of findTotalContentSize

2096a42

Mark multiframe pyzstd tests as now passing

c61ccce

Format with ruff

786b09c

Test concatenated frames of known and unknown sizes

f365747

Add docstring for findTotalContentSize

d67fba9

mkitti force-pushed the mkitti-multi-frame-zstd-clean branch from 42ec173 to d67fba9 Compare July 17, 2025 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow Zstandard to decompress multiple concatenated frames #757

Allow Zstandard to decompress multiple concatenated frames #757

Uh oh!

mkitti commented Jun 28, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 28, 2025 •

edited

Loading

Uh oh!

mkitti commented Jun 28, 2025

Uh oh!

Uh oh!

Allow Zstandard to decompress multiple concatenated frames #757

Are you sure you want to change the base?

Allow Zstandard to decompress multiple concatenated frames #757

Uh oh!

Conversation

mkitti commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mkitti commented Jun 28, 2025

Uh oh!

Uh oh!

mkitti commented Jun 28, 2025 •

edited

Loading

codecov bot commented Jun 28, 2025 •

edited

Loading