-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
Xarray already has unstack(sparse=True) which is quite awesome.
However, in many cases it is costly to convert a very dense array (existing values >> missing values) to a sparse representation. Also, many calculations require to convert the sparse array back into dense array and to manually mask the missing values (e.g. Keras).
Logically, a sparse array is equal to a masked dense array.
They only differ in their internal data representation.
Therefore, I would propose to have a masked=True option for all operations that can create missing values. These cover (amongst others):
.unstack([...], masked=True).where(<multi-dimensional array>, masked=True).align([...], masked=True)
This would solve a number of problems:
- No more conversion of int -> float
- Explicit value for missingness
- When stacking data with missing values, the missing values can be just dropped
- When converting data with missing values to DataFrame, the missing values can be just dropped
MCVE Code Sample
An example would be outer joins with slightly different coordinates (taken from the documentation):
>>> x
<xarray.DataArray (lat: 2, lon: 2)>
array([[25, 35],
[10, 24]])
Coordinates:
* lat (lat) float64 35.0 40.0
* lon (lon) float64 100.0 120.0
>>> y
<xarray.DataArray (lat: 2, lon: 2)>
array([[20, 5],
[ 7, 13]])
Coordinates:
* lat (lat) float64 35.0 42.0
* lon (lon) float64 100.0 120.0Non-masked outer join:
>>> a, b = xr.align(x, y, join="outer")
>>> a
<xarray.DataArray (lat: 3, lon: 2)>
array([[25., 35.],
[10., 24.],
[nan, nan]])
Coordinates:
* lat (lat) float64 35.0 40.0 42.0
* lon (lon) float64 100.0 120.0
>>> b
<xarray.DataArray (lat: 3, lon: 2)>
array([[20., 5.],
[nan, nan],
[ 7., 13.]])
Coordinates:
* lat (lat) float64 35.0 40.0 42.0
* lon (lon) float64 100.0 120.0The masked version:
>>> a, b = xr.align(x, y, join="outer", masked=True)
>>> a
<xarray.DataArray (lat: 3, lon: 2)>
masked_array(data=[[25, 35],
[10, 24],
[--, --]],
mask=[[False, False],
[False, False],
[True, True]],
fill_value=0)
Coordinates:
* lat (lat) float64 35.0 40.0 42.0
* lon (lon) float64 100.0 120.0
>>> b
<xarray.DataArray (lat: 3, lon: 2)>
masked_array(data=[[20, 5],
[--, --],
[7, 13]],
mask=[[False, False],
[True, True],
[False, False]],
fill_value=0)
Coordinates:
* lat (lat) float64 35.0 40.0 42.0
* lon (lon) float64 100.0 120.0Related issue:
#3955
Metadata
Metadata
Assignees
Labels
No labels