-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me.
Code Sample
# Your code here
import xarray as xr
import dask.array as da
import pandas as pd
# construct an example datatset chunked in time
nt, ny, nx = 366, 180, 360
time = pd.date_range(start='1950-01-01', periods=nt, freq='10D')
ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)),
dims=('time', 'lat', 'lon'),
coords={'time': time}).to_dataset(name='field')
# monthly climatology
ds_mm = ds.groupby('time.month').mean(dim='time')
# anomaly
ds_anom = ds.groupby('time.month')- ds_mm
print(ds_anom)
<xarray.Dataset>
Dimensions: (lat: 180, lon: 360, time: 366)
Coordinates:
* time (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ...
month (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ...
Dimensions without coordinates: lat, lon
Data variables:
field (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)>
Problem description
As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even
ds.groupby('time.month').apply(lambda x: x)
Say we wanted to compute some statistics of the anomaly, like the variance:
(ds_anom.field**2).mean(dim='time').load()
This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system.
Expected Output
It seems like we should be able to do this lazily, maintaining a chunk size of (1, 180, 360)
for ds_anom.
Output of xr.show_versions()
xarray: 0.10.0+dev27.g049cbdd
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.3.1
h5netcdf: 0.4.1
Nio: None
zarr: 2.2.0a2.dev91
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
distributed: 1.20.1
matplotlib: 2.1.0
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 36.3.0
pip: 9.0.1
conda: None
pytest: 3.2.1
IPython: 6.1.0
sphinx: 1.6.5
Possibly related to #392.
cc @mrocklin