[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8536

0x0FFF · 2015-08-31T12:26:16Z

This PR addresses SPARK-10162
The change applied:

Consider timezone information in a function that converts Python datetime.datetime object to unix timestamp for passing it to Java

…function

JoshRosen · 2015-08-31T14:39:47Z

Can you add a regression test for this, perhaps by taking the example that you gave and adding it to python/sql/tests.py?

…h timestamp containing timezone

0x0FFF · 2015-08-31T17:59:25Z

Unit test is added. Changed UTC class definition in python/pyspark/sql/tests.py to avoid introducing additional dependency on pytz or duplicating class with almost the same functionality

JoshRosen · 2015-08-31T18:00:09Z

Jenkins, this is ok to test.

SparkQA · 2015-08-31T18:10:01Z

Test build #41836 has finished for PR 8536 at commit 6589c00.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-31T18:45:38Z

Test build #41837 has finished for PR 8536 at commit 48341f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ilter function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](#8536) was occasionally closed by me dropping the repo Author: 0x0FFF <[email protected]> Closes #8555 from 0x0FFF/SPARK-10162.

…ilter function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](apache/spark#8536) was occasionally closed by me dropping the repo Author: 0x0FFF <[email protected]> Closes #8555 from 0x0FFF/SPARK-10162.

[SPARK-10162] Fix the timezone omitting for PySpark Dataframe filter …

c4f1dd2

…function

[SPARK-10162] [SQL] Added unit tests that cover DF filter queries wit…

6589c00

…h timestamp containing timezone

[SPARK-10162] Fixing PEP8 E128 Python code style problem

48341f9

0x0FFF closed this Sep 1, 2015

0x0FFF deleted the master branch September 1, 2015 11:48

0x0FFF mentioned this pull request Sep 1, 2015

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8536

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8536

Uh oh!

0x0FFF commented Aug 31, 2015

Uh oh!

JoshRosen commented Aug 31, 2015

Uh oh!

0x0FFF commented Aug 31, 2015

Uh oh!

JoshRosen commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8536

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8536

Uh oh!

Conversation

0x0FFF commented Aug 31, 2015

Uh oh!

JoshRosen commented Aug 31, 2015

Uh oh!

0x0FFF commented Aug 31, 2015

Uh oh!

JoshRosen commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants