-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Hello @shoyer @pwolfram @mrocklin @rabernat ,
I was trying to write a design/requirements doc with ref. to the Columbia meetup,
and I had a few queries, on which I wanted your inputs (basically to ask whether
they make sense or not!)
- If you serialize a labeled n-d data array using netCDF or HFD5, it gets written into
a single file, which is not really a good option if you want to eventually do distributed
processing of the data. Things like HDFS/lustreFS can split files, but that is not really
what we want. How do you think this issue could be solved within the xarray+dask
framework?- is it a matter of adding some code to the dataset.to_netcdf() method or
adding a new method that would split the DataArray (based on some user guidelines) into multiple files? - Or does it make more sense to add a new serialization format like Zarr?
- is it a matter of adding some code to the dataset.to_netcdf() method or
- Continuing along similar lines, how does xarray+dask currently decide on how to distribute the workload between dask workers? are there any heuristics to handle data locality? or does experience say that network I/O is fast enough that this is not an issue? I'm asking this question because of this article by Matt: http://blaze.pydata.org/blog/2015/10/28/distributed-hdfs/
- If this is desirable, how would one go about implementing it?