Skip to content

Conversation

huard
Copy link

@huard huard commented Nov 21, 2018

No description provided.

@huard
Copy link
Author

huard commented Nov 21, 2018

Nice work !
Could you add docstrings to functions in resample_cftime ?
Also, I don't see tests with other calendars. Is this something you think we can do without ?
I guess you're planning on scrapping the print statements in temp/ ? Is this something we should convert into tests ?
Do you think we can reuse the existing resampling tests already in xarray and run them with other calendars ?
For the official PR, we'll want to squash all commits into a single one.
I don't feel qualified to do a more in-depth review beyond that. I should have some time tomorrow to look into it in more details.

@jwenfai
Copy link

jwenfai commented Nov 21, 2018

This version of resampling does not perfectly replicate the results of pandas resampling, you'll see them when you run the tests in test_cftimeindex_resample.py. Downsampling is mostly fine but has issues with extra or missing NaN bins while for upsampling the bin edges and values do not match the ones generated by pandas in some cases. CFTimeIndex upsampling might have to have binning logic explicitly written for it to match pd.resample outputs instead of relying on pandas reindex, groupby etc. since pd.resample does not display consistent binning behavior (sometimes the values are clearly bounded by the original index dates, sometimes the values are taken from the nearest date). Look at what values get assigned to which dates by pandas for the upsampling test in test_cftimeindex_resample.py to get what I mean.

For non-standard calendars, I don't have a "gold standard" (e.g., pd.resample results) to compare the CFTimeIndex resampling results to.

The temp folders and files within will be scrapped and some content might be converted into tests. For now I'm keeping them to record instances of weird behavior from this implementation of resampling as well as pandas' resampling. Try upsampling to freq='8H' with pandas with DatetimeIndex of
times = pd.date_range('2000-01-01T13:02:03', '2000-02-01T00:00:00', freq='D', tz='UTC')

I'll look into reusing existing resampling tests in xarray. As I recall, they did not test certain edge cases (cases where closed, label, and/or base had arguments specified, cases of equal sampling etc).

I'll add the docstrings.

Currently I'm on a break while @tlogan2000 and others look for bugs beyond those I'm aware of. I'll address these issues once I'm back at the start of December.

@huard
Copy link
Author

huard commented Nov 21, 2018 via email

Copy link

@tlogan2000 tlogan2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will approve the pull request but @huard I let you do the final merge? Are you still looking at the code?

For my part I have looked at running @jwenfai 's resample version on multiple netcdf files (various calendars and frequencies ) comparing output with what is achieved using the 'groupby' workaround in xarray as well as simply using the netCDF4 library. In terms of downsampling I have always identical results ... I have not explored upsampling

@huard
Copy link
Author

huard commented Nov 21, 2018

Ok. yes.

@huard
Copy link
Author

huard commented Nov 27, 2018

@jwenfai It seems the resampling code should also be included in resample_immediately no ?

@jwenfai
Copy link

jwenfai commented Nov 27, 2018

@jwenfai It seems the resampling code should also be included in resample_immediately no ?

Yes, it seems like it should be. I was working off of the WIP here pydata#2458 and didn't notice there was an older version of resampling.

@huard
Copy link
Author

huard commented Nov 27, 2018

@biner @tlogan2000 I suggest we finish our interval review this week and let @jwenfai submit a PR to xarray early next week when he's ready to take questions from the xarray team. My guess is that this work will raise some incompatibilities between pandas and xarray that we won't solve right now.

@tlogan2000
Copy link

I agree. xarray maintainers will probably be able to provide more comments/feedback to Low with less effort than us. Really digging into the code shows that I would probably need to spend a considerable amount of time understanding the inner working of pandas / xarray resampling to provide Low with any considerable direction at this point..

@Zeitsperre
Copy link

@jwenfai could you push a new commit (can be anything, really minor) so that it triggers a new build this week? I`ve enabled Travis CI on this repo and want to ensure your fork runs well on the upstream tests.

@jwenfai
Copy link

jwenfai commented Nov 27, 2018

Just pushed a commit.

@Zeitsperre
Copy link

Thanks, looks good. Many of the failing tests are from assertion failures that in actuality are passing that deal with odd CF calendars (which is good!). It might be a good idea to @pytest.mark.skip, add new tests, or invert the True/False assertions only for these tests and see if the builds all pass. The following are the failing tests:

xarray/tests/test_cftimeindex.py::test_resample_error[365_day]
xarray/tests/test_cftimeindex.py::test_resample_error[360_day]
xarray/tests/test_cftimeindex.py::test_resample_error[julian]
xarray/tests/test_cftimeindex.py::test_resample_error[all_leap]
xarray/tests/test_cftimeindex.py::test_resample_error[366_day]
xarray/tests/test_cftimeindex.py::test_resample_error[gregorian]
xarray/tests/test_cftimeindex.py::test_resample_error[proleptic_gregorian]
...
xarray/tests/test_cftimeindex_resample.py::test_downsampler
xarray/tests/test_cftimeindex_resample.py::test_downsampler_daily
xarray/tests/test_cftimeindex_resample.py::test_upsampler
xarray/tests/test_cftimeindex_resample.py::test_upsampler_base
...
xarray/tests/test_dataarray.py::TestDataArray::test_resample_cftimeindex

@tlogan2000
Copy link

tlogan2000 commented Nov 30, 2018

I'm having trouble with the upsampling of non standard calendar netcdf files (using xclim test files:
https://github.com/Ouranosinc/xclim/tree/master/tests/testdata)

Upsampling using interpolation returns an error indicating that the new x values are above the interpolation range. This occurs for both the 'noLeap' and '360day' calendars. Standard or 'leap' calendar works as expected

Code:

import xarray as xr
import glob
import numpy as np
import matplotlib.pyplot as plt
import cftime
import warnings
warnings.filterwarnings("ignore")

xr.set_options(enable_cftimeindex=True)

# ncfile day calendars
ncFiles = {'360':'./xclim/tests/testdata/HadGEM2-CC_360day/*.nc',
      'noLeap': './xclim/tests/testdata/CanESM2_365day/*.nc',
      'leap': './xclim/tests/testdata/NRCANdaily/*.nc'
      }

# Upsample tests
for cal in sorted(ncFiles.keys()):
    infiles = glob.glob(ncFiles[cal])
    ds = xr.open_mfdataset(infiles)
    #print(ds.time)
    #print(ds.time.encoding['calendar'])
    rsmp= ds['tasmax'][:,15,15].load().resample(time='1H')

    upsmpNear = rsmp.nearest() 
    np.testing.assert_array_equal(upsmpNear[::24],ds['tasmax'][:,15,15].values)
    if not cal == 'leap':
        x1 = cftime.date2num(upsmpNear.time, upsmpNear.time.encoding['units'], upsmpNear.time.encoding['calendar'])
        xorig = cftime.date2num(ds.time, ds.time.encoding['units'], ds.time.encoding['calendar'])
    
        days = 31
        plt.plot(x1[0:days * 24], upsmpNear[0:days * 24].values, color='r')
        plt.scatter(xorig[0:days], ds['tasmax'][0:days, 15, 15])
    print('nearest neighbor upsampling results OK')

    # upsampling via interpolation gives error "ValueError: A value in x_new is above the interpolation range"
    upsmpInterp = rsmp.interpolate('linear')

@tlogan2000
Copy link

Downsampling test for same files against the 'daily_downsampler' in xclim (all successful)

from xclim.utils import daily_downsampler
for cal in sorted(ncFiles.keys()):
    infiles = glob.glob(ncFiles[cal])
    ds = xr.open_mfdataset(infiles)
    #print(ds.time)
    #print(ds.time.encoding['calendar'])
    rsmp= ds['tasmax'].resample(time='MS') # gives error
    grouper = daily_downsampler(ds['tasmax'], freq='MS')
    # check mean values vs daily_downsampler
    test_rMean = rsmp.mean(dim='time')
    test_gMean = grouper.mean(dim='time')
    #print('360 calendar mean: unique diff values ',np.unique(test_rMean.values - test_gMean.values))
    np.testing.assert_array_equal(test_gMean,test_rMean)

    # # check max values vs daily_downsampler
    test_rMax = rsmp.max(dim='time')
    test_gMax = grouper.max(dim='time')
    np.testing.assert_array_equal(test_gMax,test_rMax)

     # check max values vs daily_downsampler
    test_rMax = rsmp.max(dim='time')
    test_gMax = grouper.max(dim='time')
    np.testing.assert_array_equal(test_gMax,test_rMax)
    print(cal, ' calendar : Downsample Min, Max, Mean results identical')

    #check time coords vs daily_downsmapler
    time1 = daily_downsampler(ds.time, freq='MS').first()
    test_gMean.coords['time'] = ('tags', time1.values)
    test_gMean = test_gMean.swap_dims({'tags': 'time'})
    test_gMean = test_gMean.sortby('time')
    np.testing.assert_array_equal(test_gMean.time, test_rMean.time)
    print(cal, ' calendar : time values identical')

@sbiner
Copy link

sbiner commented Nov 30, 2018

I took a look at the code and I am not in a position to offer much comments.

I made a few tests using constructed daily data for different calendars. For downsampling from daily to monthly all is fine. I then reconstructed the daily data from data every 2 days using upsampling. It works fine. Inspired by @TBLogan I tried to upsample my daily data to 1H data. This works for standard calendar but I have errors for the other calendars, even when I use the 'nearest' kind for interpolate...

#tests using resample do downsample and upsample with theoretical values for 1D DataArray. All the results I get are as expected.

import xclim
import xarray as xr
import cftime
import numpy as np
import os

print(xclim.__file__)
print(xr.__file__)

# 365_day calendar
# generate test DataArray
units = 'days since 2001-01-01 00:00'
time_365 = cftime.num2date(np.arange(0, 1 * 365), units, '365_day')
da_365 = xr.DataArray(np.arange(time_365.size), coords=[time_365], dims='time', name='data')
units = 'days since 2001-01-01 00:00'
time_std = cftime.num2date(np.arange(0, 1 * 365), units, 'standard')
da_std_365 = xr.DataArray(np.arange(time_std.size), coords=[time_std], dims='time', name='data')
units = 'days since 2000-01-01 00:00'
time_std = cftime.num2date(np.arange(0, 1 * 366), units, 'standard')
da_std_366 = xr.DataArray(np.arange(time_std.size), coords=[time_std], dims='time', name='data')
time_360 = cftime.num2date(np.arange(0, 360), units, '360_day')
da_360 = xr.DataArray(np.arange(1, time_360.size + 1), coords=[time_360], dims='time', name='data')

f_nc = os.path.join('/Users/sbiner/PycharmProjects/test_xclim_xarray',
                    'github/xclim/tests/testdata',
                    'NRCANdaily/nrcan_canada_daily_pr_1990.nc')

da_nc = xr.open_dataset(f_nc)


# test downsampling
freq = 'MS'
da_365_m = da_365.resample(time=freq).mean()
da_std_365_m = da_std_365.resample(time=freq).mean(dim='time')
da_std_366_m = da_std_366.resample(time=freq).mean()
da_360_m = da_360.resample(time=freq).mean()

np.testing.assert_allclose(da_365.values, da_std_365.values)
np.testing.assert_allclose(da_365_m.values, da_std_365_m.values)
target_360_m = np.array(range(1,361)).reshape(-1,30).mean(axis=1)
np.testing.assert_allclose(da_360_m.values, target_360_m)

# test upsampling
# on saute un jour sur deux et on reconstruit
da_std_365_2d = da_std_365[::2]
da_std_365_d = da_std_365_2d.resample(time='D').interpolate('linear')
np.testing.assert_allclose(da_std_365.values, da_std_365_d.values)

da_std_366_2d = da_std_366[::2]
da_std_366_d = da_std_366_2d.resample(time='D').interpolate('linear')
np.testing.assert_allclose(da_std_366.values[:-1], da_std_366_d.values)

da_365_2d = da_365[::2]
da_365_d = da_365_2d.resample(time='D').interpolate('linear')
np.testing.assert_allclose(da_365.values, da_365_d.values)

da_360_2d = da_360[::2]
da_360_d = da_360_2d.resample(time='D').interpolate('linear')
np.testing.assert_allclose(da_360.values[:-1], da_360_d.values)

# test upsampling from daily to hourly
buffer = da_std_365.resample(time='H').interpolate('nearest')
buffer = da_std_366.resample(time='H').interpolate('nearest')
buffer = da_365.resample(time='H').interpolate('nearest')
buffer = da_360.resample(time='H').interpolate('nearest')

# the laste two lines issue 
# ValueError: A value in x_new is above the interpolation range.

@tlogan2000
Copy link

Info that could be helpful for the upsampling out of bounds errors:
It seems strange as the indexes seem Ok. min and max in the resample obejct are same as the dataset

rsmp= ds['tasmax'][:,15,15].load().resample(time='1H')
rsmp._full_index.max() == ds.time.values.max() # True
rsmp._full_index.min() == ds.time.values.min() # Also True

The one thing I notice is that using normal calendar files the resample index is a 'TimeStamp' e.g.

rsmp._full_index.max() #gives:
Timestamp('1990-12-31 00:00:00', freq='H')

where non-standard Calendars are cfdatetimes

rsmp._full_index.max() # gives:
cftime._cftime.Datetime360Day(2095, 12, 30, 0, 0, 0, 0, -1, 1)
 

@jwenfai
Copy link

jwenfai commented Dec 5, 2018

  1. _resample_immediately appears to be a deprecated method so I won't bother reimplementing CFTimeIndex resampling there. I can tell _resample_immediately to call resample or change the warning message to tell users to use resample instead.

  2. The issue with upsampling interpolation stems from cftime returning timedelta values that are a few microseconds off.

import cftime
start = cftime._cftime.Datetime360Day(2095, 12, 30, 0, 0, 0, 0, -1, 360)
end = cftime._cftime.Datetime360Day(2095, 12, 30, 1, 0, 0, 0, -1, 360)
print((end-start).total_seconds())

Will give you 3599.999985 when it should have been 3600.0.

I'm also documenting a cftime/timedelta quirk that does not affect upsampling but may be a potential source of problem in the future.

import cftime
import datetime
import xarray as xr
start = cftime._cftime.Datetime360Day(2095, 12, 30, 0, 0, 0, 0, -1, 360)
print('dayofyr = ' + (start - datetime.timedelta(days=1)).strftime('%j'))
print('dayofyr = ' + xr.cftime_range(start='2000-01-01', 
                                     freq='D', 
                                     periods=360, 
                                     calendar='360_day')[-1].strftime('%j'))

Both will give you 001 when they should have been 359 and 360 respectively.

@jwenfai jwenfai merged commit e64fedb into Ouranosinc:master Dec 5, 2018
Zeitsperre pushed a commit that referenced this pull request Dec 18, 2018
Get PEP8 changes from Ouranosinc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants