Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Aug 14, 2017

What changes were proposed in this pull request?

Make Pandas DataFrame with timezone-aware timestamp type when converting DataFrame to Pandas DataFrame by pyspark.sql.DataFrame.toPandas.
The session local timezone is used for the timezone.

How was this patch tested?

Added a test and existing tests.

@ueshin ueshin changed the title [SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. Aug 14, 2017
@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80608 has finished for PR 18933 at commit 0f182d0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80622 has finished for PR 18933 at commit 7df7ac9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

self.assertTrue(pdf_pst_naive.equals(pdf))

self.spark.conf.unset('spark.sql.execution.pandas.timeZoneAware')
self.spark.conf.unset('spark.sql.session.timeZone')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a big deal but we could use finally just in case this test fails and other tests do not get affected by this test failure in the future)

.intConf
.createWithDefault(10000)

val PANDAS_TIMEZONE_AWARE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other parts of the pyspark that doesn't use session local timezone. For instance, df.collect() and (maybe) python udf execution.

I am worried that having those to be inconsistent (some use local timezone, some doesn't) and complex (one configuration for each of these functionality?)

While it will be harder to fix, but how about we use one configuration to control the behavior of df.toPandas() and df.collect() and python udf regarding session local timezone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because TimestampType.toInternal does a conversion with local time and not with session local timezone.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is just putting a band-aid over the real issues and creates an obscure config that still leaves inconsistencies with time zone handling in Spark. This is also going to be more confusing to the user - how do we recommend them to use this conf? "well, if you have a session local time zone set, and if you are going to export to Pandas, and if you want that Pandas DataFrame to have a timezone the same as your session."

If we are going to make changes here, it needs to be more complete to eliminate any inconsistencies. At that point, if a session local timezone is set then you could make the Pandas DataFrame timezone aware without a need for another conf.


pdf_naive = df.toPandas()
self.assertEqual(pdf_naive['ts'][0].tzinfo, None)
self.assertTrue(pdf_naive.equals(pdf))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a test that df.toPandas() is time zone naive. If that was true then you should be able to do

df = self.spark.createDataFrame([(ts,)], schema)
os.environ["TZ"] = "America/New_York"
time.tzset()
pdf_naive = df.toPandas()
self.assertTrue(pdf_naive.equals(pdf))

but this will fail because toPandas() does a conversion to local time, which is what the original data happens to be

.intConf
.createWithDefault(10000)

val PANDAS_TIMEZONE_AWARE =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because TimestampType.toInternal does a conversion with local time and not with session local timezone.

@BryanCutler
Copy link
Member

I was wondering what your thoughts were on what this conf should do for the case that Arrow was enabled and spark.sql.execution.pandas.timeZoneAware was false? Would the time zone be converted to local time and removed, sort of the opposite way as done here when it is true?

@BryanCutler
Copy link
Member

Hi @ueshin , I've been following SPARK-12297 PR #19250 that deals with some of the same issues as here. I think they are proposing a conf that the user could set to make timestamps tz naive. Do you think that might apply here or is it specific to SQL/Hive tables?

for f, t in dtype.items():
pdf[f] = pdf[f].astype(t, copy=False)

if self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", "false").lower() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to treat it as a bug and always respect the session local timezone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a conf, even if it is a bug. This is just to avoid breaking any existing app. We can remove the conf in Spark 3.x.

.doc("When true, make Pandas DataFrame with timezone-aware timestamp type when converting " +
"by pyspark.sql.DataFrame.toPandas. The session local timezone is used for the timezone.")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change the default to true, since we agree that this is a bug.

@felixcheung
Copy link
Member

Ping. I ran into this exact issue with pandas_udf on a simple data set with a timestamp type column.
As far as I can tell, there is no way to around this since pandas code is running deep inside pyspark and the only workaround is to make the column a string?

@BryanCutler @ueshin @icexelloss @HyukjinKwon any thought on how to fix this?

@icexelloss
Copy link
Contributor

I thought this is resolved. @felixcheung can you give an example of the issue you ran into?

@felixcheung
Copy link
Member

@ueshin
Copy link
Member Author

ueshin commented Feb 12, 2018

I'd close this for now. We can open another pr if needed. We need another implementation anyway.

@ueshin ueshin closed this Feb 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants