Skip to content

groupby on dask objects doesn't handle chunks well #1832

@rabernat

Description

@rabernat

80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me.

Code Sample

# Your code here
import xarray as xr
import dask.array as da
import pandas as pd
# construct an example datatset chunked in time
nt, ny, nx = 366, 180, 360
time = pd.date_range(start='1950-01-01', periods=nt, freq='10D')
ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)),
                   dims=('time', 'lat', 'lon'),
                   coords={'time': time}).to_dataset(name='field')
# monthly climatology
ds_mm = ds.groupby('time.month').mean(dim='time')
# anomaly
ds_anom = ds.groupby('time.month')- ds_mm
print(ds_anom)
<xarray.Dataset>
Dimensions:  (lat: 180, lon: 360, time: 366)
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ...
    month    (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ...
Dimensions without coordinates: lat, lon
Data variables:
    field    (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)>

Problem description

As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even

ds.groupby('time.month').apply(lambda x: x)

Say we wanted to compute some statistics of the anomaly, like the variance:

(ds_anom.field**2).mean(dim='time').load()

This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system.

Expected Output

It seems like we should be able to do this lazily, maintaining a chunk size of (1, 180, 360) for ds_anom.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.0+dev27.g049cbdd
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.3.1
h5netcdf: 0.4.1
Nio: None
zarr: 2.2.0a2.dev91
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
distributed: 1.20.1
matplotlib: 2.1.0
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 36.3.0
pip: 9.0.1
conda: None
pytest: 3.2.1
IPython: 6.1.0
sphinx: 1.6.5

Possibly related to #392.

cc @mrocklin

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions