Skip to content

Conversation

jsignell
Copy link
Contributor

@jsignell jsignell commented Jul 22, 2016

After the conversation #432

d['coords'].update({k: {'data': self[k].values.tolist(),
'dims': list(self[k].dims),
'attrs': dict(self[k].attrs)}})
if hasattr(self, 'data_vars'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sorts of checks are best avoided, if possible. I would simply write two implementations of to_dict, one on Dataset and one on DataArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I changed it below.

@shoyer
Copy link
Member

shoyer commented Jul 26, 2016

This looks like a great start, but to be really useful it needs a couple other things:

  • corresponding Dataset.from_dict and DataArray.from_dict class methods for creating xarray datasets from the output of to_dict
  • unit tests verifying that these methods output the expected thing, and that a dataset can be faithfully round-tripped through to_dict/from_dict
  • a section in the "Serialization and IO" docs describing how to use these methods
  • what's new note

Please ask if you need guidance on where to start for any of these! This will be a very welcome addition to xarray :).

@jsignell
Copy link
Contributor Author

jsignell commented Jul 27, 2016

Ok, I wrote a Dataset.from_dict class method. I imagine the DataArray one will look pretty similar, so I just wanted to see what you think. The main issue that I see is that time doesn't round trip. I wasn't sure if the user should need to set a parse_date flag with the dim name or something or whether the function should try to convert to time any dim with the string 'time' in it?

    @classmethod
    def from_dict(cls, d):
        """
        Convert a dictionary into an xarray.Dataset.
        """
        obj = cls()

        dims=OrderedDict([(k, d['coords'][k]) for k in d['dims']])
        for dim, dim_d in dims.items():
            obj[dim] = (dim_d['dims'], dim_d['data'], dim_d['attrs'])

        for var, var_d in d['data_vars'].items():
            obj[var] = (var_d['dims'], var_d['data'], var_d['attrs'])

        # what it coords aren't dims?
        coords = (set(d['coords'].keys()) - set(d['dims']))
        for coord in coords:
            coord_d = d['coords'][coord]
            obj[coord] = (coord_d['dims'], coord_d['data'], coord_d['attrs'])
        obj = obj.set_coords(coords)

        obj.attrs.update(d['attrs'])

        return(obj)

Regarding unit tests, I haven't ever written one before but I would be happy to try (I know they are an important part of good development).

@jsignell
Copy link
Contributor Author

Ok, I have committed a first pass at unit tests. I purposefully wrote a failing time test.

@shoyer
Copy link
Member

shoyer commented Jul 27, 2016

I think the datetime issue is a numpy bug related to numpy/numpy#7619

We can work around this by casting to datetime64[ns]:

(Pdb) ds['t'].values.tolist()
[1356998400000000000, 1357084800000000000, 1357171200000000000, 1357257600000000000, 1357344000000000000, 1357430400000000000, 1357516800000000000, 1357603200000000000, 1357689600000000000, 1357776000000000000]
(Pdb) ds['t'].values.astype('datetime64[us]').tolist()
[datetime.datetime(2013, 1, 1, 0, 0), datetime.datetime(2013, 1, 2, 0, 0), datetime.datetime(2013, 1, 3, 0, 0), datetime.datetime(2013, 1, 4, 0, 0), datetime.datetime(2013, 1, 5, 0, 0), datetime.datetime(2013, 1, 6, 0, 0), datetime.datetime(2013, 1, 7, 0, 0), datetime.datetime(2013, 1, 8, 0, 0), datetime.datetime(2013, 1, 9, 0, 0), datetime.datetime(2013, 1, 10, 0, 0)]

The same issue holds for timedelta:

(Pdb) (ds['t'] - ds['t'][0]).values.tolist()
[0, 86400000000000, 172800000000000, 259200000000000, 345600000000000, 432000000000000, 518400000000000, 604800000000000, 691200000000000, 777600000000000]
(Pdb) (ds['t'] - ds['t'][0]).values.astype('timedelta64[us]').tolist()
[datetime.timedelta(0), datetime.timedelta(1), datetime.timedelta(2), datetime.timedelta(3), datetime.timedelta(4), datetime.timedelta(5), datetime.timedelta(6), datetime.timedelta(7), datetime.timedelta(8), datetime.timedelta(9)]

We can work around these by checking for timedelta64 or datetime64 dtypes (use np.issubdtype(values.dtype, np.datetime64), and using astype to convert to datetime64[us]/timedelta64[us].

}
},
'attrs': {},
'dims': ['t'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe save dims on a dataset as a dict instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of that? For DataArray it will have to be a list or an OrderedDict anyways so that the shape of the data matches the shape of the dims.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true -- dims is inconsistent between Dataset and DataArray. But short of switching dims on DataArray to an OrderedDict (maybe not a bad idea, but a separate discussion), I think the serialization format should be consistent with xarray's data model.

@shoyer
Copy link
Member

shoyer commented Jul 27, 2016

One big thing that from_dict needs is validation. If the input dict does not match the expected format (e.g., there is a missing field), the user should get a sensible error message so they understand what went wrong.

"""
Convert a dictionary into an xarray.Dataset.
"""
obj = cls()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way that Dataset works internally, it can be much slower (for a large number of variables) to add them incrementally to a Dataset rather than to call the Dataset constructor once. Also, the behavior could differ, because the variables are aligned incrementally rather than all at once.

So I would consider building up ordered dictionaries for data_vars, coords and attrs and calling cls(data_vars, coords, attrs) at the end.

@shoyer
Copy link
Member

shoyer commented Jul 27, 2016

It would also be good to test roundtripping arrays with some NaN and NaT values as well

@jsignell
Copy link
Contributor Author

jsignell commented Jul 27, 2016

Ok, I made the changes that you suggested. I still need to work on the from_dict validation to prompt users to give more specific dicts.

@jsignell
Copy link
Contributor Author

jsignell commented Jul 28, 2016

@shoyer I think I have done all the things that you mentioned and I added DataArray.from_dict().

try:
coords = OrderedDict([(var[0], (var[1]['dims'],
var[1]['data'],
var[1].get('attrs'))) for var in d['coords'].items()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple unpacking is more readable than using indexing: for k, v in d['coords'].items()

@jsignell
Copy link
Contributor Author

Ok, so I merged with the current master and added documentation and a what's-new note. I hope I did that right. @shoyer I am really new to contributing so thanks for all your help. Let me know if anything needs changing.

remains unchanged. Because the internal design of xarray is still being
refined, we make no guarantees (at this point) that objects pickled with
this version of xarray will work in future versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a more specific target here, e.g.,

.. _dictionary io:

@shoyer
Copy link
Member

shoyer commented Aug 11, 2016

I have a few minor suggestions, but otherwise this looks very nice!

To ensure that the documentation subpages are built for these methods (which makes the sphinx link work), you need to add them to doc/api.rst. These should go in the IO / Conversion section.

If you haven't tested the docs locally with make html, take a look at the development version of the docs (http://xarray.pydata.org/en/latest/) about 5-10 minutes after I merge this to make sure everything looks right. RST can be tricky to get right.

'attrs': {'title': 'air temperature'},
'dims': 't',
'data': x,
d = {'coords': {'t': {'dims': 't', 'data': t, \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bad form? It renders more cleanly in the html docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use :: and indentation, which will format this as code in the HTML docs: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#other-points-to-keep-in-mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \n and the like are a little bad in the docstring because they aren't valid Python code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense

@jsignell
Copy link
Contributor Author

I made the docs locally and they look how I expect.

@jsignell
Copy link
Contributor Author

Done!

@shoyer shoyer merged commit b708f71 into pydata:master Aug 11, 2016
@shoyer
Copy link
Member

shoyer commented Aug 11, 2016

Thank you @jsignell -- really nice work here!

@kwilcox
Copy link

kwilcox commented Oct 17, 2016

@jsignell was the intention to have only python scalars/lists in the dictionary (no numpy generics or arrays)? Right now attribute values are not being converted and results in a mixed dict of the data as a list but attributes still in numpy form. Thoughts?

@jsignell
Copy link
Contributor Author

jsignell commented Oct 17, 2016

@kwilcox, I hadn't thought much about this. I guess the intention was to have only python scalars/lists, but I don't know if we want to get in the business of converting attribute values since there are so many options for what they could be. What do you think makes the most sense?

@kwilcox
Copy link

kwilcox commented Oct 17, 2016

Before I looked at the code I assumed it was going to convert... but that's just me!

As it stands a custom encoder is needed to get a JSON dump from the output, which will be a fairly common use case for this function. See https://gist.github.com/kwilcox/c41834297b1a3b732cae3ee16621f6d0.

In the least maybe a little note in the documentation on how to dump the to_dict output to JSON (you also need to include decode_times=False because datetime objects are not JSON serializable either).

@shoyer
Copy link
Member

shoyer commented Oct 17, 2016

I think we should try to convert numpy scalars/arrays, because otherwise the data won't convert directly to JSON. Possibly could take a duck typing approach of looking for .item() or .tolist() methods?

@jsignell
Copy link
Contributor Author

jsignell commented Oct 17, 2016

@shoyer that makes sense. @kwilcox, so times should be strings if decode_times=False? Should I do a new pull request?

@shoyer
Copy link
Member

shoyer commented Oct 17, 2016

We should recommend decode_times=False in the docs but I would still make datetime/timedelta objects -- this feels like something that should be left up to users.

@jsignell
Copy link
Contributor Author

@shoyer is this something I should be working on? I am happy to, just don't know how this normally goes.

@shoyer
Copy link
Member

shoyer commented Oct 17, 2016

Up to you, but yes it would be greatly appreciated if you could work on
this.

On Mon, Oct 17, 2016 at 12:15 PM, Julia Signell [email protected]
wrote:

@shoyer https://github.com/shoyer is this something I should be working
on? I am happy to, just don't know how this normally goes.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#917 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1qqKU4SU-c_e533xHM0uyBibdRLpks5q08lGgaJpZM4JS9Hw
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants