added to_dict function for xarray objects #917

jsignell · 2016-07-22T17:14:03Z

After the conversation #432

shoyer · 2016-07-22T17:23:52Z

xarray/core/common.py

+            d['coords'].update({k: {'data': self[k].values.tolist(),
+                                    'dims': list(self[k].dims),
+                                    'attrs': dict(self[k].attrs)}})
+        if hasattr(self, 'data_vars'):


These sorts of checks are best avoided, if possible. I would simply write two implementations of to_dict, one on Dataset and one on DataArray.

That makes sense. I changed it below.

shoyer · 2016-07-26T17:37:16Z

This looks like a great start, but to be really useful it needs a couple other things:

corresponding Dataset.from_dict and DataArray.from_dict class methods for creating xarray datasets from the output of to_dict
unit tests verifying that these methods output the expected thing, and that a dataset can be faithfully round-tripped through to_dict/from_dict
a section in the "Serialization and IO" docs describing how to use these methods
what's new note

Please ask if you need guidance on where to start for any of these! This will be a very welcome addition to xarray :).

jsignell · 2016-07-27T12:53:53Z

Ok, I wrote a Dataset.from_dict class method. I imagine the DataArray one will look pretty similar, so I just wanted to see what you think. The main issue that I see is that time doesn't round trip. I wasn't sure if the user should need to set a parse_date flag with the dim name or something or whether the function should try to convert to time any dim with the string 'time' in it?

    @classmethod
    def from_dict(cls, d):
        """
        Convert a dictionary into an xarray.Dataset.
        """
        obj = cls()

        dims=OrderedDict([(k, d['coords'][k]) for k in d['dims']])
        for dim, dim_d in dims.items():
            obj[dim] = (dim_d['dims'], dim_d['data'], dim_d['attrs'])

        for var, var_d in d['data_vars'].items():
            obj[var] = (var_d['dims'], var_d['data'], var_d['attrs'])

        # what it coords aren't dims?
        coords = (set(d['coords'].keys()) - set(d['dims']))
        for coord in coords:
            coord_d = d['coords'][coord]
            obj[coord] = (coord_d['dims'], coord_d['data'], coord_d['attrs'])
        obj = obj.set_coords(coords)

        obj.attrs.update(d['attrs'])

        return(obj)

Regarding unit tests, I haven't ever written one before but I would be happy to try (I know they are an important part of good development).

…ghlight roundtripping issue

jsignell · 2016-07-27T14:42:12Z

Ok, I have committed a first pass at unit tests. I purposefully wrote a failing time test.

shoyer · 2016-07-27T16:29:50Z

I think the datetime issue is a numpy bug related to numpy/numpy#7619

We can work around this by casting to datetime64[ns]:

(Pdb) ds['t'].values.tolist()
[1356998400000000000, 1357084800000000000, 1357171200000000000, 1357257600000000000, 1357344000000000000, 1357430400000000000, 1357516800000000000, 1357603200000000000, 1357689600000000000, 1357776000000000000]
(Pdb) ds['t'].values.astype('datetime64[us]').tolist()
[datetime.datetime(2013, 1, 1, 0, 0), datetime.datetime(2013, 1, 2, 0, 0), datetime.datetime(2013, 1, 3, 0, 0), datetime.datetime(2013, 1, 4, 0, 0), datetime.datetime(2013, 1, 5, 0, 0), datetime.datetime(2013, 1, 6, 0, 0), datetime.datetime(2013, 1, 7, 0, 0), datetime.datetime(2013, 1, 8, 0, 0), datetime.datetime(2013, 1, 9, 0, 0), datetime.datetime(2013, 1, 10, 0, 0)]

The same issue holds for timedelta:

(Pdb) (ds['t'] - ds['t'][0]).values.tolist()
[0, 86400000000000, 172800000000000, 259200000000000, 345600000000000, 432000000000000, 518400000000000, 604800000000000, 691200000000000, 777600000000000]
(Pdb) (ds['t'] - ds['t'][0]).values.astype('timedelta64[us]').tolist()
[datetime.timedelta(0), datetime.timedelta(1), datetime.timedelta(2), datetime.timedelta(3), datetime.timedelta(4), datetime.timedelta(5), datetime.timedelta(6), datetime.timedelta(7), datetime.timedelta(8), datetime.timedelta(9)]

We can work around these by checking for timedelta64 or datetime64 dtypes (use np.issubdtype(values.dtype, np.datetime64), and using astype to convert to datetime64[us]/timedelta64[us].

shoyer · 2016-07-27T16:30:39Z

xarray/test/test_dataset.py

+                                    }
+                              },
+                    'attrs': {}, 
+                    'dims': ['t'],


Maybe save dims on a dataset as a dict instead?

What is the value of that? For DataArray it will have to be a list or an OrderedDict anyways so that the shape of the data matches the shape of the dims.

It's true -- dims is inconsistent between Dataset and DataArray. But short of switching dims on DataArray to an OrderedDict (maybe not a bad idea, but a separate discussion), I think the serialization format should be consistent with xarray's data model.

shoyer · 2016-07-27T17:04:55Z

One big thing that from_dict needs is validation. If the input dict does not match the expected format (e.g., there is a missing field), the user should get a sensible error message so they understand what went wrong.

shoyer · 2016-07-27T17:08:45Z

xarray/core/dataset.py

+        """
+        Convert a dictionary into an xarray.Dataset.
+        """
+        obj = cls()


Because of the way that Dataset works internally, it can be much slower (for a large number of variables) to add them incrementally to a Dataset rather than to call the Dataset constructor once. Also, the behavior could differ, because the variables are aligned incrementally rather than all at once.

So I would consider building up ordered dictionaries for data_vars, coords and attrs and calling cls(data_vars, coords, attrs) at the end.

shoyer · 2016-07-27T17:14:51Z

It would also be good to test roundtripping arrays with some NaN and NaT values as well

…ft as a dict

jsignell · 2016-07-27T19:30:49Z

Ok, I made the changes that you suggested. I still need to work on the from_dict validation to prompt users to give more specific dicts.

…roved error messages

jsignell · 2016-07-28T23:18:45Z

@shoyer I think I have done all the things that you mentioned and I added DataArray.from_dict().

shoyer · 2016-07-31T04:28:03Z

xarray/core/dataarray.py

+            try:
+                coords = OrderedDict([(var[0], (var[1]['dims'], 
+                                                var[1]['data'], 
+                                                var[1].get('attrs'))) for var in d['coords'].items()])


Tuple unpacking is more readable than using indexing: for k, v in d['coords'].items()

jsignell · 2016-08-11T14:52:52Z

Ok, so I merged with the current master and added documentation and a what's-new note. I hope I did that right. @shoyer I am really new to contributing so thanks for all your help. Let me know if anything needs changing.

shoyer · 2016-08-11T16:11:47Z

doc/io.rst

   remains unchanged. Because the internal design of xarray is still being
   refined, we make no guarantees (at this point) that objects pickled with
   this version of xarray will work in future versions.



add a more specific target here, e.g.,

.. _dictionary io:

shoyer · 2016-08-11T16:30:17Z

I have a few minor suggestions, but otherwise this looks very nice!

To ensure that the documentation subpages are built for these methods (which makes the sphinx link work), you need to add them to doc/api.rst. These should go in the IO / Conversion section.

If you haven't tested the docs locally with make html, take a look at the development version of the docs (http://xarray.pydata.org/en/latest/) about 5-10 minutes after I merge this to make sure everything looks right. RST can be tricky to get right.

jsignell · 2016-08-11T18:34:55Z

xarray/core/dataarray.py

-             'attrs': {'title': 'air temperature'},
-             'dims': 't',
-             'data': x,
+        d = {'coords': {'t': {'dims': 't', 'data': t, \


Is this bad form? It renders more cleanly in the html docs.

I would use :: and indentation, which will format this as code in the HTML docs: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#other-points-to-keep-in-mind

The \n and the like are a little bad in the docstring because they aren't valid Python code.

That makes sense

jsignell · 2016-08-11T18:40:00Z

I made the docs locally and they look how I expect.

jsignell · 2016-08-11T19:18:28Z

Done!

shoyer · 2016-08-11T21:54:35Z

Thank you @jsignell -- really nice work here!

kwilcox · 2016-10-17T14:11:58Z

@jsignell was the intention to have only python scalars/lists in the dictionary (no numpy generics or arrays)? Right now attribute values are not being converted and results in a mixed dict of the data as a list but attributes still in numpy form. Thoughts?

jsignell · 2016-10-17T14:19:38Z

@kwilcox, I hadn't thought much about this. I guess the intention was to have only python scalars/lists, but I don't know if we want to get in the business of converting attribute values since there are so many options for what they could be. What do you think makes the most sense?

kwilcox · 2016-10-17T14:28:23Z

Before I looked at the code I assumed it was going to convert... but that's just me!

As it stands a custom encoder is needed to get a JSON dump from the output, which will be a fairly common use case for this function. See https://gist.github.com/kwilcox/c41834297b1a3b732cae3ee16621f6d0.

In the least maybe a little note in the documentation on how to dump the to_dict output to JSON (you also need to include decode_times=False because datetime objects are not JSON serializable either).

shoyer · 2016-10-17T14:30:46Z

I think we should try to convert numpy scalars/arrays, because otherwise the data won't convert directly to JSON. Possibly could take a duck typing approach of looking for .item() or .tolist() methods?

jsignell · 2016-10-17T14:37:15Z

@shoyer that makes sense. @kwilcox, so times should be strings if decode_times=False? Should I do a new pull request?

shoyer · 2016-10-17T15:00:50Z

We should recommend decode_times=False in the docs but I would still make datetime/timedelta objects -- this feels like something that should be left up to users.

jsignell · 2016-10-17T19:15:16Z

@shoyer is this something I should be working on? I am happy to, just don't know how this normally goes.

shoyer · 2016-10-17T20:33:01Z

Up to you, but yes it would be greatly appreciated if you could work on
this.

On Mon, Oct 17, 2016 at 12:15 PM, Julia Signell [email protected]
wrote:

@shoyer https://github.com/shoyer is this something I should be working
on? I am happy to, just don't know how this normally goes.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#917 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1qqKU4SU-c_e533xHM0uyBibdRLpks5q08lGgaJpZM4JS9Hw
.

added to_dict function for xarray objects

c2ac179

jsignell mentioned this pull request Jul 22, 2016

Tools for converting between xray.Dataset and nested dictionaries/JSON #432

Closed

shoyer reviewed Jul 22, 2016
View reviewed changes

separated out to_dict functions for dataset and dataarray

d132873

added Dataset.from_dict and unit tests with a failing time test to hi…

97f05eb

…ghlight roundtripping issue

shoyer reviewed Jul 27, 2016
View reviewed changes

fixed date error, faster object creation in from_dict, dims is now le…

6bddb1b

…ft as a dict

jsignell added 2 commits July 27, 2016 15:42

added nan and nat unit test on from_dict to_dict roundtripping

25db204

added dataarray.from_dict and increased flexibility of functions, imp…

127cd96

…roved error messages

shoyer reviewed Jul 31, 2016
View reviewed changes

jsignell added 2 commits August 11, 2016 09:28

Merge remote-tracking branch 'upstream/master' into HEAD

7a38ae4

added to_dict, from_dict note in docs/io and whats-new

1420eda

shoyer reviewed Aug 11, 2016
View reviewed changes

small edits to documentation to improve rendering

e77621d

jsignell reviewed Aug 11, 2016
View reviewed changes

better doc rendering

592f7bc

shoyer merged commit b708f71 into pydata:master Aug 11, 2016

jsignell mentioned this pull request Oct 18, 2016

catch numpy arrays in attrs before converting to dict #1052

Merged

nicain mentioned this pull request Sep 29, 2017

DataArray to_dict() without converting with numpy tolist() #1599

Closed

Uh oh!

added to_dict function for xarray objects #917

added to_dict function for xarray objects #917

Uh oh!

Conversation

jsignell commented Jul 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jul 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Jul 27, 2016

Uh oh!

shoyer commented Jul 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jul 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jul 27, 2016

Uh oh!

jsignell commented Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell commented Aug 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Aug 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell commented Aug 11, 2016

Uh oh!

jsignell commented Aug 11, 2016

Uh oh!

shoyer commented Aug 11, 2016

Uh oh!

kwilcox commented Oct 17, 2016

Uh oh!

jsignell commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwilcox commented Oct 17, 2016

Uh oh!

shoyer commented Oct 17, 2016

Uh oh!

jsignell commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Oct 17, 2016

Uh oh!

jsignell commented Oct 17, 2016

Uh oh!

shoyer commented Oct 17, 2016

Uh oh!

jsignell commented Jul 22, 2016 •

edited

Loading

shoyer commented Jul 26, 2016 •

edited

Loading

jsignell commented Jul 27, 2016 •

edited

Loading

jsignell commented Jul 27, 2016 •

edited

Loading

jsignell commented Jul 28, 2016 •

edited

Loading

jsignell commented Oct 17, 2016 •

edited

Loading

jsignell commented Oct 17, 2016 •

edited

Loading