groupby on dask objects doesn't handle chunks well

80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me.

#### Code Sample

```python
# Your code here
import xarray as xr
import dask.array as da
import pandas as pd
# construct an example datatset chunked in time
nt, ny, nx = 366, 180, 360
time = pd.date_range(start='1950-01-01', periods=nt, freq='10D')
ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)),
                   dims=('time', 'lat', 'lon'),
                   coords={'time': time}).to_dataset(name='field')
# monthly climatology
ds_mm = ds.groupby('time.month').mean(dim='time')
# anomaly
ds_anom = ds.groupby('time.month')- ds_mm
print(ds_anom)
```
```
<xarray.Dataset>
Dimensions:  (lat: 180, lon: 360, time: 366)
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ...
    month    (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ...
Dimensions without coordinates: lat, lon
Data variables:
    field    (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)>
```
#### Problem description

As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even
```python
ds.groupby('time.month').apply(lambda x: x)
```

Say we wanted to compute some statistics of the anomaly, like the variance:
```python
(ds_anom.field**2).mean(dim='time').load()
```
This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system.

#### Expected Output
It seems like we should be able to do this lazily, maintaining a chunk size of `(1, 180, 360)` for ds_anom.

#### Output of ``xr.show_versions()``

<details>
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.0+dev27.g049cbdd
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.3.1
h5netcdf: 0.4.1
Nio: None
zarr: 2.2.0a2.dev91
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
distributed: 1.20.1
matplotlib: 2.1.0
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 36.3.0
pip: 9.0.1
conda: None
pytest: 3.2.1
IPython: 6.1.0
sphinx: 1.6.5
</details>


Possibly related to #392.

cc @mrocklin 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

groupby on dask objects doesn't handle chunks well #1832

Code Sample

Problem description

Expected Output

Output of `xr.show_versions()`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

groupby on dask objects doesn't handle chunks well #1832

Description

Code Sample

Problem description

Expected Output

Output of xr.show_versions()

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `xr.show_versions()`