-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Implement BaseOffset in tslibs.offsets #18016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 'hours', 'minutes', 'seconds', 'milliseconds', 'microseconds' | ||
| ]) | ||
|
|
||
| def _determine_offset(kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this is a method of DateOffset that only gets called in __init__.
| _cacheable = True | ||
|
|
||
|
|
||
| class BeginMixin(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BeginMixin and EndMixin are new, each only have the one method. At the moment these methods are in DateOffset, but they are only used by a small handful of FooBegin and BarEnd subclasses.
| def __neg__(self): | ||
| # Note: we are defering directly to __mul__ instead of __rmul__, as | ||
| # that allows us to use methods that can go in a `cdef class` | ||
| return self * -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the status quo __neg__ is defined as return self.__class__(-self.n, normalize=self.normalize, **self.kwds). By deferring to __mul__, we move away from the self.kwds pattern. Ditto for copy.
pandas/tests/tseries/test_offsets.py
Outdated
| from pandas.tseries.holiday import USFederalHolidayCalendar | ||
|
|
||
|
|
||
| data_dir = tm.get_data_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this call to up here ensures that we get the same data_dir whether running the tests via pytest or interactively. Under the status quo, copy/pasting the pertinent test below will fail because get_data_path will not behave as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? we use this pattern everywhere, why are you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when I try to run these tests interactively and copy/paste the contents of a test function, tm.get_data_path returns unexpected results depending on os.getcwd(). AFAICT when run non-interactively it behaves as if cwd is pandas/tests/tseries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean 'interactively'? you should simply be running
pytest pandas/tests/...... -k ... or whatever that is the idiomatic way to run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a test fails and I want to figure out why, I run the contents of the test manually in the REPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to revert this change; not that big a deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls revert.
standard way to run tests is
pytest path/to/test -k optional_regex
lots of options, including --pdb to drop into the debuger
pls revert this is non-standard
Codecov Report
@@ Coverage Diff @@
## master #18016 +/- ##
==========================================
- Coverage 91.23% 91.21% -0.03%
==========================================
Files 163 163
Lines 50091 50032 -59
==========================================
- Hits 45703 45636 -67
- Misses 4388 4396 +8
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18016 +/- ##
==========================================
+ Coverage 91.28% 91.41% +0.12%
==========================================
Files 163 163
Lines 50130 50073 -57
==========================================
+ Hits 45761 45772 +11
+ Misses 4369 4301 -68
Continue to review full report at Codecov.
|
|
| # --------------------------------------------------------------------- | ||
| # Base Classes | ||
|
|
||
| class _BaseOffset(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you creating a base class here? what is the purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW why not simply have 1 Base class (and not a _BaseOffset and a BaseOffset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments about remaining cython/pickle issues.
You're absolutely right that in its current form having two separate classes accomplishes nothing. The idea is that _BaseOffset should be a cdef class, while BaseOffset should be python class. (__rfoo__ methods do not play nicely with cython classes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably leave this as a class for the moment. I am not convinced this actually needs to be a full c-extension class (e.g. its not like we are inheriting from a python c-class here). I don't see the benefit and it has added complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to achieve immutability. That's the big roadblock between us and making __eq__, __ne__, __hash__ performance not-awful. (There's an issue somewhere about "scalar types immutable" or something like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/tests/tseries/test_offsets.py
Outdated
| from pandas.tseries.holiday import USFederalHolidayCalendar | ||
|
|
||
|
|
||
| data_dir = tm.get_data_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? we use this pattern everywhere, why are you changing this?
pandas/tseries/offsets.py
Outdated
| # default _from_name calls cls with no args | ||
| if suffix: | ||
| raise ValueError("Bad freq suffix {suffix}".format(suffix=suffix)) | ||
| raise ValueError("Bad freq suffix %s" % suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert, we are moving towards new style string formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, copy/paste from an older version. Will revert.
| def _should_cache(self): | ||
| return self.isAnchored() and self._cacheable | ||
|
|
||
| def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note, the repr is currently used for hashing, but instead should simply define __hash__ I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__hash__ is defined using _params() which is the god-awful slow thing we need to get rid of.
|
small comments, and rebase |
|
@jreback For triaging purposes, this is the only one of my PRs that is blocking non-refactoring work. |
| from conversion cimport tz_convert_single | ||
| from pandas._libs.tslib import pydt_to_i8 | ||
|
|
||
| from frequencies cimport get_freq_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update setup.py for this
|
lgtm ping on green |
|
TestClipboard.test_round_trip_valid_encodings, otherwise green. Will push a dummy commit anyway. |
|
Ping |
|
thanks! |
This moves a handful of methods of
DateOffsetup intotslibs.offsets.BaseOffset. The focus for now is on arithmetic methods that do not get overridden by subclasses. These use theself.__class__(..., **self.kwds)pattern that we eventually need to get rid of. Isolating this pattern before suggesting alternatives.The
_BaseOffsetclass was intended to be acdefclass, but that leads to errors intest_pickle_v0_15_2that I haven't figured out yet. Once that gets sorted out, we can makeDateOffsetimmutable and see some real speedups via caching.See other comments in-line.