Some queries

Hello @shoyer @pwolfram @mrocklin @rabernat ,

I was trying to write a design/requirements doc with ref. to the Columbia meetup,
and I had a few queries, on which I wanted your inputs (basically to ask whether
they make sense or not!)

1. If you serialize a labeled n-d data array using netCDF or HFD5, it gets written into
a single file, which is not really a good option if you want to eventually do distributed
processing of the data. Things like HDFS/lustreFS can split files, but that is not really
what we want. How do you think this issue could be solved within the xarray+dask
framework? 
   * is it a matter of adding some code to the dataset.to_netcdf() method or
     adding a new method that would split the DataArray (based on some user guidelines) into multiple files?
   * Or does it make more sense to add a new serialization format like Zarr?
2. Continuing along similar lines, how does xarray+dask currently decide on how to distribute the workload between dask workers? are there any heuristics to handle data locality? or does experience say that network I/O is fast enough that this is not an issue? I'm asking this question because of this article by Matt: http://blaze.pydata.org/blog/2015/10/28/distributed-hdfs/
   * If this is desirable, how would one go about implementing it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Some queries #1173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Some queries #1173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions