[WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. #18933

ueshin · 2017-08-14T03:37:34Z

What changes were proposed in this pull request?

Make Pandas DataFrame with timezone-aware timestamp type when converting DataFrame to Pandas DataFrame by pyspark.sql.DataFrame.toPandas.
The session local timezone is used for the timezone.

How was this patch tested?

Added a test and existing tests.

SparkQA · 2017-08-14T05:47:48Z

Test build #80608 has finished for PR 18933 at commit 0f182d0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-14T11:12:08Z

Test build #80622 has finished for PR 18933 at commit 7df7ac9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-15T14:49:37Z

python/pyspark/sql/tests.py

+        self.assertTrue(pdf_pst_naive.equals(pdf))
+
+        self.spark.conf.unset('spark.sql.execution.pandas.timeZoneAware')
+        self.spark.conf.unset('spark.sql.session.timeZone')


(Not a big deal but we could use finally just in case this test fails and other tests do not get affected by this test failure in the future)

icexelloss · 2017-08-15T16:03:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .intConf
      .createWithDefault(10000)

+  val PANDAS_TIMEZONE_AWARE =


There are other parts of the pyspark that doesn't use session local timezone. For instance, df.collect() and (maybe) python udf execution.

I am worried that having those to be inconsistent (some use local timezone, some doesn't) and complex (one configuration for each of these functionality?)

While it will be harder to fix, but how about we use one configuration to control the behavior of df.toPandas() and df.collect() and python udf regarding session local timezone?

Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because TimestampType.toInternal does a conversion with local time and not with session local timezone.

BryanCutler

I think this change is just putting a band-aid over the real issues and creates an obscure config that still leaves inconsistencies with time zone handling in Spark. This is also going to be more confusing to the user - how do we recommend them to use this conf? "well, if you have a session local time zone set, and if you are going to export to Pandas, and if you want that Pandas DataFrame to have a timezone the same as your session."

If we are going to make changes here, it needs to be more complete to eliminate any inconsistencies. At that point, if a session local timezone is set then you could make the Pandas DataFrame timezone aware without a need for another conf.

BryanCutler · 2017-08-15T17:43:43Z

python/pyspark/sql/tests.py

+
+        pdf_naive = df.toPandas()
+        self.assertEqual(pdf_naive['ts'][0].tzinfo, None)
+        self.assertTrue(pdf_naive.equals(pdf))


This is not really a test that df.toPandas() is time zone naive. If that was true then you should be able to do

df = self.spark.createDataFrame([(ts,)], schema) os.environ["TZ"] = "America/New_York" time.tzset() pdf_naive = df.toPandas() self.assertTrue(pdf_naive.equals(pdf))

but this will fail because toPandas() does a conversion to local time, which is what the original data happens to be

BryanCutler · 2017-08-15T17:48:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .intConf
      .createWithDefault(10000)

+  val PANDAS_TIMEZONE_AWARE =


Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because TimestampType.toInternal does a conversion with local time and not with session local timezone.

BryanCutler · 2017-08-18T00:21:18Z

I was wondering what your thoughts were on what this conf should do for the case that Arrow was enabled and spark.sql.execution.pandas.timeZoneAware was false? Would the time zone be converted to local time and removed, sort of the opposite way as done here when it is true?

BryanCutler · 2017-10-03T18:09:50Z

Hi @ueshin , I've been following SPARK-12297 PR #19250 that deals with some of the same issues as here. I think they are proposing a conf that the user could set to make timestamps tz naive. Do you think that might apply here or is it specific to SQL/Hive tables?

cloud-fan · 2017-10-19T14:31:33Z

python/pyspark/sql/dataframe.py

            for f, t in dtype.items():
                pdf[f] = pdf[f].astype(t, copy=False)
+
+            if self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", "false").lower() \


I'd like to treat it as a bug and always respect the session local timezone.

We still need a conf, even if it is a bug. This is just to avoid breaking any existing app. We can remove the conf in Spark 3.x.

gatorsmile · 2017-10-20T06:40:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("When true, make Pandas DataFrame with timezone-aware timestamp type when converting " +
+        "by pyspark.sql.DataFrame.toPandas. The session local timezone is used for the timezone.")
+      .booleanConf
+      .createWithDefault(false)


We can change the default to true, since we agree that this is a bug.

felixcheung · 2018-02-01T05:54:23Z

Ping. I ran into this exact issue with pandas_udf on a simple data set with a timestamp type column.
As far as I can tell, there is no way to around this since pandas code is running deep inside pyspark and the only workaround is to make the column a string?

@BryanCutler @ueshin @icexelloss @HyukjinKwon any thought on how to fix this?

icexelloss · 2018-02-01T15:37:30Z

I thought this is resolved. @felixcheung can you give an example of the issue you ran into?

felixcheung · 2018-02-02T07:20:11Z

I opened https://issues.apache.org/jira/browse/SPARK-23314 @icexelloss

ueshin · 2018-02-12T06:01:23Z

I'd close this for now. We can open another pr if needed. We need another implementation anyway.

Enable timezone-aware timestamp type when creating Pandas DataFrame.

0f182d0

ueshin changed the title ~~[SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame.~~ [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. Aug 14, 2017

ueshin mentioned this pull request Aug 14, 2017

[SPARK-21375][PYSPARK][SQL] Add Date and Timestamp support to ArrowConverters for toPandas() Conversion #18664

Closed

Use dateutil.tz.tzlocal directly.

7df7ac9

HyukjinKwon reviewed Aug 15, 2017

View reviewed changes

icexelloss reviewed Aug 15, 2017

View reviewed changes

BryanCutler reviewed Aug 15, 2017

View reviewed changes

cloud-fan reviewed Oct 19, 2017

View reviewed changes

gatorsmile reviewed Oct 20, 2017

View reviewed changes

ueshin closed this Feb 12, 2018

[WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. #18933

[WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. #18933

Uh oh!

Conversation

ueshin commented Aug 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 14, 2017

Uh oh!

SparkQA commented Aug 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 18, 2017

Uh oh!

BryanCutler commented Oct 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Feb 1, 2018

Uh oh!

icexelloss commented Feb 1, 2018

Uh oh!

felixcheung commented Feb 2, 2018

Uh oh!

ueshin commented Feb 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants