-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49693][PYTHON][CONNECT] Refine the string representation of timedelta
#48159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| delta = DayTimeIntervalType().fromInternal(self._value) | ||
| if delta is not None and isinstance(delta, datetime.timedelta): | ||
| try: | ||
| import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark connect requires pyarrow/pandas iirc so you won't need to handle exceptions here if that's what you're covering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the pandas is a mandatory dependency for connect, let me remove this try-catch
| s = str(sf.lit(delta)) | ||
|
|
||
| # Parse the ISO string representation and compare | ||
| self.assertTrue(pd.Timedelta(s[8:-2]).to_pytimedelta() == delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be connect specific test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also works for pyspark classic.
Classic also use a ISO-8601 string, but JVM side and Pandas apply different units.
A string representation from the JVM side can also be parsed by Pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will be ran in both classic and connect
|
thanks, merged to master |
What changes were proposed in this pull request?
Refine the string representation of
timedelta, by following the ISO format.Note that the used units in JVM side (
Duration) and Pandas are different.Why are the changes needed?
We should not leak the raw data
Does this PR introduce any user-facing change?
yes
PySpark Classic:
PySpark Connect (before):
PySpark Connect (after):
How was this patch tested?
added test
Was this patch authored or co-authored using generative AI tooling?
no