-
Notifications
You must be signed in to change notification settings - Fork 222
Open
Labels
coreChanges to core moduleChanges to core module
Description
Currently spike interface uses zarr v2 which does not have a compatible API with zarr v3 which has different features (mainly sharding).
My motivation behind sharding is that I like being able to load a single channel pretty fast, so I usually put channel_chunk_size fairly low, but by doing so, I get a huge number of files (384x2x3600 for a 2h recording...).
I've tried to migrate the code of zarrextractors.py and here is my conclusion:
- migrating to v3 is mostly easy by changing
group.create_dataset(key=name,..., data=data)
bygroup.create_array(key=name, ..., shape=data.shape, dtype=data.dtype)
followed bygroup[name][:] = data
- we have a problem for structured numpy arrays which were handled in v2 and not in v3. These seem in the process of being handled by zarr, see this pr. Either we wish to rely on this future implementation or we can have a custom convention for handling them. A solution could be to create a group with a name like "_structuredarraygrp[name]" and put each individual array in it.
- However, for reading the structured arrays, the code is less obvious: either we remove the lazyness of zarr and we just create the numpy array from the group each time, or we need to find a way to provide a lazyarray that handles structured dtypes. Perhaps by using dask ?
- adding sharding is extremely easy as its just a parameter of create_array. However, one needs to modify the functions that process arguments.
Is there any interest in such a migration (enough for me to submit a pr) ?
alejoe91
Metadata
Metadata
Assignees
Labels
coreChanges to core moduleChanges to core module