-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. #18933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #80608 has finished for PR 18933 at commit
|
|
Test build #80622 has finished for PR 18933 at commit
|
| self.assertTrue(pdf_pst_naive.equals(pdf)) | ||
|
|
||
| self.spark.conf.unset('spark.sql.execution.pandas.timeZoneAware') | ||
| self.spark.conf.unset('spark.sql.session.timeZone') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not a big deal but we could use finally just in case this test fails and other tests do not get affected by this test failure in the future)
| .intConf | ||
| .createWithDefault(10000) | ||
|
|
||
| val PANDAS_TIMEZONE_AWARE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other parts of the pyspark that doesn't use session local timezone. For instance, df.collect() and (maybe) python udf execution.
I am worried that having those to be inconsistent (some use local timezone, some doesn't) and complex (one configuration for each of these functionality?)
While it will be harder to fix, but how about we use one configuration to control the behavior of df.toPandas() and df.collect() and python udf regarding session local timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because TimestampType.toInternal does a conversion with local time and not with session local timezone.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is just putting a band-aid over the real issues and creates an obscure config that still leaves inconsistencies with time zone handling in Spark. This is also going to be more confusing to the user - how do we recommend them to use this conf? "well, if you have a session local time zone set, and if you are going to export to Pandas, and if you want that Pandas DataFrame to have a timezone the same as your session."
If we are going to make changes here, it needs to be more complete to eliminate any inconsistencies. At that point, if a session local timezone is set then you could make the Pandas DataFrame timezone aware without a need for another conf.
|
|
||
| pdf_naive = df.toPandas() | ||
| self.assertEqual(pdf_naive['ts'][0].tzinfo, None) | ||
| self.assertTrue(pdf_naive.equals(pdf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really a test that df.toPandas() is time zone naive. If that was true then you should be able to do
df = self.spark.createDataFrame([(ts,)], schema)
os.environ["TZ"] = "America/New_York"
time.tzset()
pdf_naive = df.toPandas()
self.assertTrue(pdf_naive.equals(pdf))
but this will fail because toPandas() does a conversion to local time, which is what the original data happens to be
| .intConf | ||
| .createWithDefault(10000) | ||
|
|
||
| val PANDAS_TIMEZONE_AWARE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because TimestampType.toInternal does a conversion with local time and not with session local timezone.
|
I was wondering what your thoughts were on what this conf should do for the case that Arrow was enabled and |
|
Hi @ueshin , I've been following SPARK-12297 PR #19250 that deals with some of the same issues as here. I think they are proposing a conf that the user could set to make timestamps tz naive. Do you think that might apply here or is it specific to SQL/Hive tables? |
| for f, t in dtype.items(): | ||
| pdf[f] = pdf[f].astype(t, copy=False) | ||
|
|
||
| if self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", "false").lower() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to treat it as a bug and always respect the session local timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need a conf, even if it is a bug. This is just to avoid breaking any existing app. We can remove the conf in Spark 3.x.
| .doc("When true, make Pandas DataFrame with timezone-aware timestamp type when converting " + | ||
| "by pyspark.sql.DataFrame.toPandas. The session local timezone is used for the timezone.") | ||
| .booleanConf | ||
| .createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change the default to true, since we agree that this is a bug.
|
Ping. I ran into this exact issue with pandas_udf on a simple data set with a timestamp type column. @BryanCutler @ueshin @icexelloss @HyukjinKwon any thought on how to fix this? |
|
I thought this is resolved. @felixcheung can you give an example of the issue you ran into? |
|
I'd close this for now. We can open another pr if needed. We need another implementation anyway. |
What changes were proposed in this pull request?
Make Pandas DataFrame with timezone-aware timestamp type when converting
DataFrameto Pandas DataFrame bypyspark.sql.DataFrame.toPandas.The session local timezone is used for the timezone.
How was this patch tested?
Added a test and existing tests.