Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 12, 2020

What changes were proposed in this pull request?

Move the code related to days rebasing from/to Julian calendar from HiveInspectors to new class DaysWritable.

Why are the changes needed?

To improve maintainability of the HiveInspectors trait which is already pretty complex.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By HiveOrcHadoopFsRelationSuite.

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119713 has finished for PR 27890 at commit 2000387.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DaysWritable(var gregorianDays: Int = 0) extends DateWritable

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119720 has finished for PR 27890 at commit af4549e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 12, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119725 has finished for PR 27890 at commit af4549e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 13, 2020

@cloud-fan @HyukjinKwon Please, take a look at this.

} else {
daysSinceEpoch
}
x.getPrimitiveWritableObject(data).getDays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the returned DateWritable here is not our custom version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, DateWritable can come here. I have fixed that by adding new constructor to DaysWritable to construct its instances from DateWritable.

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119749 has finished for PR 27890 at commit a017fbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

})
}

override def getDays: Int = gregorianDays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we return julianDays here to conform to the hive API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Proleptic Gregorian calendar: 1582-01-01 -> -141714
// Julian calendar: 1582-01-01 -> -141704
// The code below converts -141714 to -141704.
val writeDays = DaysWritable.rebaseGregorianToJulianDays(gregorianDays)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we cache the julianDays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119761 has finished for PR 27890 at commit f606899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 57854c7 Mar 16, 2020
@HyukjinKwon
Copy link
Member

+1 Looks good to me too

cloud-fan pushed a commit that referenced this pull request Mar 16, 2020
### What changes were proposed in this pull request?
Move the code related to days rebasing from/to Julian calendar from `HiveInspectors` to new class `DaysWritable`.

### Why are the changes needed?
To improve maintainability of the `HiveInspectors` trait which is already pretty complex.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `HiveOrcHadoopFsRelationSuite`.

Closes #27890 from MaxGekk/replace-DateWritable-by-DaysWritable.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57854c7)
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 26, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes #28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 26, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes #28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d72ec85)
Signed-off-by: Dongjoon Hyun <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Move the code related to days rebasing from/to Julian calendar from `HiveInspectors` to new class `DaysWritable`.

### Why are the changes needed?
To improve maintainability of the `HiveInspectors` trait which is already pretty complex.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `HiveOrcHadoopFsRelationSuite`.

Closes apache#27890 from MaxGekk/replace-DateWritable-by-DaysWritable.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by apache#27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR apache#27890 (and fixed by apache#27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes apache#28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@MaxGekk MaxGekk deleted the replace-DateWritable-by-DaysWritable branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants