Skip to content

[Feature request] Masked operations #4143

@Hoeze

Description

@Hoeze

Xarray already has unstack(sparse=True) which is quite awesome.
However, in many cases it is costly to convert a very dense array (existing values >> missing values) to a sparse representation. Also, many calculations require to convert the sparse array back into dense array and to manually mask the missing values (e.g. Keras).

Logically, a sparse array is equal to a masked dense array.
They only differ in their internal data representation.
Therefore, I would propose to have a masked=True option for all operations that can create missing values. These cover (amongst others):

  • .unstack([...], masked=True)
  • .where(<multi-dimensional array>, masked=True)
  • .align([...], masked=True)

This would solve a number of problems:

  • No more conversion of int -> float
  • Explicit value for missingness
  • When stacking data with missing values, the missing values can be just dropped
  • When converting data with missing values to DataFrame, the missing values can be just dropped

MCVE Code Sample

An example would be outer joins with slightly different coordinates (taken from the documentation):

>>> x
<xarray.DataArray (lat: 2, lon: 2)>
array([[25, 35],
       [10, 24]])
Coordinates:
* lat      (lat) float64 35.0 40.0
* lon      (lon) float64 100.0 120.0

>>> y
<xarray.DataArray (lat: 2, lon: 2)>
array([[20,  5],
       [ 7, 13]])
Coordinates:
* lat      (lat) float64 35.0 42.0
* lon      (lon) float64 100.0 120.0

Non-masked outer join:

>>> a, b = xr.align(x, y, join="outer")
>>> a
<xarray.DataArray (lat: 3, lon: 2)>
array([[25., 35.],
       [10., 24.],
       [nan, nan]])
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0
>>> b
<xarray.DataArray (lat: 3, lon: 2)>
array([[20.,  5.],
       [nan, nan],
       [ 7., 13.]])
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0

The masked version:

>>> a, b = xr.align(x, y, join="outer", masked=True)
>>> a
<xarray.DataArray (lat: 3, lon: 2)>
masked_array(data=[[25, 35],
                   [10, 24],
                   [--, --]],
             mask=[[False, False],
                   [False, False],
                   [True, True]],
             fill_value=0)
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0
>>> b
<xarray.DataArray (lat: 3, lon: 2)>
masked_array(data=[[20, 5],
                   [--, --],
                   [7, 13]],
             mask=[[False, False],
                   [True, True],
                   [False, False]],
             fill_value=0)
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0

Related issue:
#3955

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions