-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. #18933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware timestamp type when creating Pandas DataFrame. #18933
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2507,6 +2507,37 @@ def test_to_pandas(self): | |
| self.assertEquals(types[2], np.bool) | ||
| self.assertEquals(types[3], np.float32) | ||
|
|
||
| @unittest.skipIf(not _have_pandas, "Pandas not installed") | ||
| def test_to_pandas_timezone_aware(self): | ||
| import pandas as pd | ||
| from dateutil import tz | ||
| tzlocal = tz.tzlocal() | ||
| ts = datetime.datetime(1970, 1, 1) | ||
| pdf = pd.DataFrame.from_records([[ts]], columns=['ts']) | ||
|
|
||
| self.spark.conf.set('spark.sql.session.timeZone', 'America/Los_Angeles') | ||
|
|
||
| schema = StructType().add("ts", TimestampType()) | ||
| df = self.spark.createDataFrame([(ts,)], schema) | ||
|
|
||
| pdf_naive = df.toPandas() | ||
| self.assertEqual(pdf_naive['ts'][0].tzinfo, None) | ||
| self.assertTrue(pdf_naive.equals(pdf)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not really a test that but this will fail because |
||
|
|
||
| self.spark.conf.set('spark.sql.execution.pandas.timeZoneAware', 'true') | ||
|
|
||
| pdf_pst = df.toPandas() | ||
| self.assertEqual(pdf_pst['ts'][0].tzinfo.zone, 'America/Los_Angeles') | ||
| self.assertFalse(pdf_pst.equals(pdf)) | ||
|
|
||
| pdf_pst_naive = pdf_pst.copy() | ||
| pdf_pst_naive['ts'] = pdf_pst_naive['ts'].apply( | ||
| lambda ts: ts.tz_convert(tzlocal).tz_localize(None)) | ||
| self.assertTrue(pdf_pst_naive.equals(pdf)) | ||
|
|
||
| self.spark.conf.unset('spark.sql.execution.pandas.timeZoneAware') | ||
| self.spark.conf.unset('spark.sql.session.timeZone') | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Not a big deal but we could use |
||
|
|
||
| def test_create_dataframe_from_array_of_long(self): | ||
| import array | ||
| data = [Row(longarray=array.array('l', [-9223372036854775808, 0, 9223372036854775807]))] | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -912,6 +912,14 @@ object SQLConf { | |
| .intConf | ||
| .createWithDefault(10000) | ||
|
|
||
| val PANDAS_TIMEZONE_AWARE = | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are other parts of the pyspark that doesn't use session local timezone. For instance, df.collect() and (maybe) python udf execution. I am worried that having those to be inconsistent (some use local timezone, some doesn't) and complex (one configuration for each of these functionality?) While it will be harder to fix, but how about we use one configuration to control the behavior of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I agree with this. There is also inconsistent behavior when bringing data into Spark because |
||
| buildConf("spark.sql.execution.pandas.timeZoneAware") | ||
| .internal() | ||
| .doc("When true, make Pandas DataFrame with timezone-aware timestamp type when converting " + | ||
| "by pyspark.sql.DataFrame.toPandas. The session local timezone is used for the timezone.") | ||
| .booleanConf | ||
| .createWithDefault(false) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can change the default to |
||
|
|
||
| object Deprecated { | ||
| val MAPRED_REDUCE_TASKS = "mapred.reduce.tasks" | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to treat it as a bug and always respect the session local timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need a conf, even if it is a bug. This is just to avoid breaking any existing app. We can remove the conf in Spark 3.x.