Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 19, 2020

What changes were proposed in this pull request?

In the PR, I propose to correct and re-use functions from DateTimeUtils for rebasing days before the cutover day 1582-10-15 in org.apache.spark.sql.hive.DaysWritable.

Why are the changes needed?

  1. Existing rebasing of days in DaysWritable is not correct.
  2. To deduplicate code in DaysWritable
  3. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5

Does this PR introduce any user-facing change?

This PR can introduce behavior change because the replaced code is different from the re-used code from DateTimeUtils.

How was this patch tested?

By existing test suite, for instance HiveOrcHadoopFsRelationSuite.

@SparkQA
Copy link

SparkQA commented Mar 19, 2020

Test build #120072 has finished for PR 27962 at commit c12cd47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 20, 2020

@cloud-fan This should be considered as a bug fix. I will update the PR description.

// Julian calendar: 1582-01-01 -> -141704
// The code below converts -141714 to -141704.
def rebaseGregorianToJulianDays(daysSinceEpoch: Int): Int = {
if (daysSinceEpoch < DateTimeUtils.GREGORIAN_CUTOVER_DAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we always rebase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like parquet/avro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this follows the java Date/Timestamp conversion.

@HyukjinKwon HyukjinKwon changed the title [SPARK-31195][SQL] Reuse days rebase functions of DateTimeUtils in DaysWritable [SPARK-31195][SQL] Correct and reuse days rebase functions of DateTimeUtils in DaysWritable Mar 20, 2020
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request Mar 20, 2020
…meUtils` in `DaysWritable`

### What changes were proposed in this pull request?
In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`.

### Why are the changes needed?
0. Existing rebasing of days in `DaysWritable` is not correct.
1. To deduplicate code in `DaysWritable`
2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5

### Does this PR introduce any user-facing change?
This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`.

### How was this patch tested?
By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`.

Closes #27962 from MaxGekk/reuse-rebase-funcs.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 6a66876)
Signed-off-by: HyukjinKwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 26, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes #28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 26, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes #28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d72ec85)
Signed-off-by: Dongjoon Hyun <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…meUtils` in `DaysWritable`

### What changes were proposed in this pull request?
In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`.

### Why are the changes needed?
0. Existing rebasing of days in `DaysWritable` is not correct.
1. To deduplicate code in `DaysWritable`
2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5

### Does this PR introduce any user-facing change?
This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`.

### How was this patch tested?
By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`.

Closes apache#27962 from MaxGekk/reuse-rebase-funcs.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by apache#27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR apache#27890 (and fixed by apache#27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes apache#28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@MaxGekk MaxGekk deleted the reuse-rebase-funcs branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants