-
Notifications
You must be signed in to change notification settings - Fork 1
Resample v2 clean #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
more bugs fixed, cleaned.
Nice work ! |
This version of resampling does not perfectly replicate the results of pandas resampling, you'll see them when you run the tests in test_cftimeindex_resample.py. Downsampling is mostly fine but has issues with extra or missing NaN bins while for upsampling the bin edges and values do not match the ones generated by pandas in some cases. CFTimeIndex upsampling might have to have binning logic explicitly written for it to match For non-standard calendars, I don't have a "gold standard" (e.g., The temp folders and files within will be scrapped and some content might be converted into tests. For now I'm keeping them to record instances of weird behavior from this implementation of resampling as well as pandas' resampling. Try upsampling to I'll look into reusing existing resampling tests in xarray. As I recall, they did not test certain edge cases (cases where closed, label, and/or base had arguments specified, cases of equal sampling etc). I'll add the docstrings. Currently I'm on a break while @tlogan2000 and others look for bugs beyond those I'm aware of. I'll address these issues once I'm back at the start of December. |
Got it, thanks for the explanations. Will dive into it.
Le mar. 20 nov. 2018 23:33, jwenfai <[email protected]> a écrit :
… This version of resampling does not perfectly replicate the results of
pandas resampling, you'll see them when you run the tests in
test_cftimeindex_resample.py. Downsampling is mostly fine but has issues
with extra or missing NaN bins while for upsampling the bin edges and
values do not match the ones generated by pandas in some cases. CFTimeIndex
upsampling might have to have binning logic explicitly written for it to
match pd.resample outputs instead of relying on pandas reindex, groupby
etc. since pd.resample does not display consistent binning behavior
(sometimes the values are clearly bounded by the original index dates,
sometimes the values are taken from the nearest date). Look at what values
get assigned to which dates by pandas for the upsampling test in
test_cftimeindex_resample.py to get what I mean.
For non-standard calendars, I don't have a "gold standard" (e.g.,
pd.resample results) to compare the CFTimeIndex resampling results to.
The temp folders and files within will be scrapped and some content might
be converted into tests. For now I'm keeping them to record instances of
weird behavior from this implementation of resampling as well as pandas'
resampling. Try upsampling to freq='8H' with pandas with DatetimeIndex of
times = pd.date_range('2000-01-01T13:02:03', '2000-02-01T00:00:00',
freq='D', tz='UTC')
I'll look into reusing existing resampling tests in xarray. As I recall,
they did not test certain edge cases (cases where closed, label, and/or
base had arguments specified, cases of equal sampling etc).
I'll add the docstrings.
Currently I'm on a break while @tlogan2000 <https://github.com/tlogan2000>
and others look for bugs beyond those I'm aware of. I'll address these
issues once I'm back at the start of December.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAE9Q971TG40HCAuRJr2NPVUaFmTqXL1ks5uxNeugaJpZM4YsRRd>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will approve the pull request but @huard I let you do the final merge? Are you still looking at the code?
For my part I have looked at running @jwenfai 's resample version on multiple netcdf files (various calendars and frequencies ) comparing output with what is achieved using the 'groupby' workaround in xarray as well as simply using the netCDF4 library. In terms of downsampling I have always identical results ... I have not explored upsampling
Ok. yes. |
@jwenfai It seems the resampling code should also be included in |
Yes, it seems like it should be. I was working off of the WIP here pydata#2458 and didn't notice there was an older version of resampling. |
@biner @tlogan2000 I suggest we finish our interval review this week and let @jwenfai submit a PR to xarray early next week when he's ready to take questions from the xarray team. My guess is that this work will raise some incompatibilities between pandas and xarray that we won't solve right now. |
I agree. xarray maintainers will probably be able to provide more comments/feedback to Low with less effort than us. Really digging into the code shows that I would probably need to spend a considerable amount of time understanding the inner working of pandas / xarray resampling to provide Low with any considerable direction at this point.. |
@jwenfai could you push a new commit (can be anything, really minor) so that it triggers a new build this week? I`ve enabled Travis CI on this repo and want to ensure your fork runs well on the upstream tests. |
Just pushed a commit. |
Thanks, looks good. Many of the failing tests are from assertion failures that in actuality are passing that deal with odd CF calendars (which is good!). It might be a good idea to
|
I'm having trouble with the upsampling of non standard calendar netcdf files (using xclim test files: Upsampling using interpolation returns an error indicating that the new x values are above the interpolation range. This occurs for both the 'noLeap' and '360day' calendars. Standard or 'leap' calendar works as expected Code: import xarray as xr
import glob
import numpy as np
import matplotlib.pyplot as plt
import cftime
import warnings
warnings.filterwarnings("ignore")
xr.set_options(enable_cftimeindex=True)
# ncfile day calendars
ncFiles = {'360':'./xclim/tests/testdata/HadGEM2-CC_360day/*.nc',
'noLeap': './xclim/tests/testdata/CanESM2_365day/*.nc',
'leap': './xclim/tests/testdata/NRCANdaily/*.nc'
}
# Upsample tests
for cal in sorted(ncFiles.keys()):
infiles = glob.glob(ncFiles[cal])
ds = xr.open_mfdataset(infiles)
#print(ds.time)
#print(ds.time.encoding['calendar'])
rsmp= ds['tasmax'][:,15,15].load().resample(time='1H')
upsmpNear = rsmp.nearest()
np.testing.assert_array_equal(upsmpNear[::24],ds['tasmax'][:,15,15].values)
if not cal == 'leap':
x1 = cftime.date2num(upsmpNear.time, upsmpNear.time.encoding['units'], upsmpNear.time.encoding['calendar'])
xorig = cftime.date2num(ds.time, ds.time.encoding['units'], ds.time.encoding['calendar'])
days = 31
plt.plot(x1[0:days * 24], upsmpNear[0:days * 24].values, color='r')
plt.scatter(xorig[0:days], ds['tasmax'][0:days, 15, 15])
print('nearest neighbor upsampling results OK')
# upsampling via interpolation gives error "ValueError: A value in x_new is above the interpolation range"
upsmpInterp = rsmp.interpolate('linear') |
Downsampling test for same files against the 'daily_downsampler' in xclim (all successful) from xclim.utils import daily_downsampler
for cal in sorted(ncFiles.keys()):
infiles = glob.glob(ncFiles[cal])
ds = xr.open_mfdataset(infiles)
#print(ds.time)
#print(ds.time.encoding['calendar'])
rsmp= ds['tasmax'].resample(time='MS') # gives error
grouper = daily_downsampler(ds['tasmax'], freq='MS')
# check mean values vs daily_downsampler
test_rMean = rsmp.mean(dim='time')
test_gMean = grouper.mean(dim='time')
#print('360 calendar mean: unique diff values ',np.unique(test_rMean.values - test_gMean.values))
np.testing.assert_array_equal(test_gMean,test_rMean)
# # check max values vs daily_downsampler
test_rMax = rsmp.max(dim='time')
test_gMax = grouper.max(dim='time')
np.testing.assert_array_equal(test_gMax,test_rMax)
# check max values vs daily_downsampler
test_rMax = rsmp.max(dim='time')
test_gMax = grouper.max(dim='time')
np.testing.assert_array_equal(test_gMax,test_rMax)
print(cal, ' calendar : Downsample Min, Max, Mean results identical')
#check time coords vs daily_downsmapler
time1 = daily_downsampler(ds.time, freq='MS').first()
test_gMean.coords['time'] = ('tags', time1.values)
test_gMean = test_gMean.swap_dims({'tags': 'time'})
test_gMean = test_gMean.sortby('time')
np.testing.assert_array_equal(test_gMean.time, test_rMean.time)
print(cal, ' calendar : time values identical') |
I took a look at the code and I am not in a position to offer much comments. I made a few tests using constructed daily data for different calendars. For downsampling from daily to monthly all is fine. I then reconstructed the daily data from data every 2 days using upsampling. It works fine. Inspired by @TBLogan I tried to upsample my daily data to 1H data. This works for standard calendar but I have errors for the other calendars, even when I use the 'nearest' kind for interpolate...
|
Info that could be helpful for the upsampling out of bounds errors: rsmp= ds['tasmax'][:,15,15].load().resample(time='1H')
rsmp._full_index.max() == ds.time.values.max() # True
rsmp._full_index.min() == ds.time.values.min() # Also True The one thing I notice is that using normal calendar files the resample index is a 'TimeStamp' e.g. rsmp._full_index.max() #gives:
Timestamp('1990-12-31 00:00:00', freq='H') where non-standard Calendars are cfdatetimes rsmp._full_index.max() # gives:
cftime._cftime.Datetime360Day(2095, 12, 30, 0, 0, 0, 0, -1, 1)
|
import cftime
start = cftime._cftime.Datetime360Day(2095, 12, 30, 0, 0, 0, 0, -1, 360)
end = cftime._cftime.Datetime360Day(2095, 12, 30, 1, 0, 0, 0, -1, 360)
print((end-start).total_seconds()) Will give you I'm also documenting a cftime/timedelta quirk that does not affect upsampling but may be a potential source of problem in the future. import cftime
import datetime
import xarray as xr
start = cftime._cftime.Datetime360Day(2095, 12, 30, 0, 0, 0, 0, -1, 360)
print('dayofyr = ' + (start - datetime.timedelta(days=1)).strftime('%j'))
print('dayofyr = ' + xr.cftime_range(start='2000-01-01',
freq='D',
periods=360,
calendar='360_day')[-1].strftime('%j')) Both will give you |
Get PEP8 changes from Ouranosinc.
No description provided.