Skip to content

Some queries #1173

@JoyMonteiro

Description

@JoyMonteiro

Hello @shoyer @pwolfram @mrocklin @rabernat ,

I was trying to write a design/requirements doc with ref. to the Columbia meetup,
and I had a few queries, on which I wanted your inputs (basically to ask whether
they make sense or not!)

  1. If you serialize a labeled n-d data array using netCDF or HFD5, it gets written into
    a single file, which is not really a good option if you want to eventually do distributed
    processing of the data. Things like HDFS/lustreFS can split files, but that is not really
    what we want. How do you think this issue could be solved within the xarray+dask
    framework?
    • is it a matter of adding some code to the dataset.to_netcdf() method or
      adding a new method that would split the DataArray (based on some user guidelines) into multiple files?
    • Or does it make more sense to add a new serialization format like Zarr?
  2. Continuing along similar lines, how does xarray+dask currently decide on how to distribute the workload between dask workers? are there any heuristics to handle data locality? or does experience say that network I/O is fast enough that this is not an issue? I'm asking this question because of this article by Matt: http://blaze.pydata.org/blog/2015/10/28/distributed-hdfs/
    • If this is desirable, how would one go about implementing it?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions