Skip to content

Migrating to zarr v3 #4014

@JulienBrn

Description

@JulienBrn

Currently spike interface uses zarr v2 which does not have a compatible API with zarr v3 which has different features (mainly sharding).
My motivation behind sharding is that I like being able to load a single channel pretty fast, so I usually put channel_chunk_size fairly low, but by doing so, I get a huge number of files (384x2x3600 for a 2h recording...).

I've tried to migrate the code of zarrextractors.py and here is my conclusion:

  • migrating to v3 is mostly easy by changing group.create_dataset(key=name,..., data=data) by group.create_array(key=name, ..., shape=data.shape, dtype=data.dtype) followed by group[name][:] = data
  • we have a problem for structured numpy arrays which were handled in v2 and not in v3. These seem in the process of being handled by zarr, see this pr. Either we wish to rely on this future implementation or we can have a custom convention for handling them. A solution could be to create a group with a name like "_structuredarraygrp[name]" and put each individual array in it.
  • However, for reading the structured arrays, the code is less obvious: either we remove the lazyness of zarr and we just create the numpy array from the group each time, or we need to find a way to provide a lazyarray that handles structured dtypes. Perhaps by using dask ?
  • adding sharding is extremely easy as its just a parameter of create_array. However, one needs to modify the functions that process arguments.

Is there any interest in such a migration (enough for me to submit a pr) ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreChanges to core module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions