Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 19, 2024

What changes were proposed in this pull request?

Refine the string representation of timedelta, by following the ISO format.
Note that the used units in JVM side (Duration) and Pandas are different.

Why are the changes needed?

We should not leak the raw data

Does this PR introduce any user-facing change?

yes

PySpark Classic:

In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'PT24H1S'>

PySpark Connect (before):

In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'86401000000'>

PySpark Connect (after):

In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'P1DT0H0M1S'>

How was this patch tested?

added test

Was this patch authored or co-authored using generative AI tooling?

no

delta = DayTimeIntervalType().fromInternal(self._value)
if delta is not None and isinstance(delta, datetime.timedelta):
try:
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark connect requires pyarrow/pandas iirc so you won't need to handle exceptions here if that's what you're covering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the pandas is a mandatory dependency for connect, let me remove this try-catch

s = str(sf.lit(delta))

# Parse the ISO string representation and compare
self.assertTrue(pd.Timedelta(s[8:-2]).to_pytimedelta() == delta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be connect specific test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also works for pyspark classic.

Classic also use a ISO-8601 string, but JVM side and Pandas apply different units.

A string representation from the JVM side can also be parsed by Pandas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will be ran in both classic and connect

@zhengruifeng
Copy link
Contributor Author

thanks, merged to master

@zhengruifeng zhengruifeng deleted the pc_lit_delta branch September 19, 2024 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants