[SPARK-23426][SQL] Use `hive` ORC impl and disable PPD for Spark 2.3.0 #20610

dongjoon-hyun · 2018-02-14T18:26:28Z

What changes were proposed in this pull request?

To prevent any regressions, this PR changes ORC implementation to hive by default like Spark 2.2.X.
Users can enable native ORC. Also, ORC PPD is also restored to false like Spark 2.2.X.

How was this patch tested?

Pass all test cases.

gatorsmile · 2018-02-14T18:29:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .stringConf
    .checkValues(Set("hive", "native"))
-    .createWithDefault("native")
+    .createWithDefault("hive")


We also need to disable the ORC pushdown, because the ORC reader of Hive 1.2.1 has a few bugs.

BTW, we don't have a test case for that, do we? Actually, I want to have a test case for that.

gatorsmile · 2018-02-14T18:30:35Z

docs/sql-programming-guide.md

      <tr>
        <td><code>spark.sql.orc.impl</code></td>
-        <td><code>native</code></td>
+        <td><code>hive</code></td>


We do not need this in the migration guide. Please create a new section for ORC

Is there a reason the impl was changed back to the old implementation? this breaks spark.read.orc

dongjoon-hyun · 2018-02-14T18:43:41Z

docs/sql-programming-guide.md

+    <td><code>true</code></td>
+    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
+  </tr>
+</table>


@gatorsmile . Now, this becomes a section.

SparkQA · 2018-02-14T20:48:40Z

Test build #87450 has finished for PR 20610 at commit 2d74b20.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-14T20:55:23Z

Ur, can I make another PR to fix the test failures?

Error Message
org.apache.spark.sql.AnalysisException: Hive built-in ORC data source must be used with Hive support enabled. Please use the native ORC data source by setting 'spark.sql.orc.impl' to 'native';

SparkQA · 2018-02-14T21:00:30Z

Test build #87451 has finished for PR 20610 at commit 7ff4ccf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-14T21:10:50Z

I think it makes sense to fix the test cases in the same PR, as long as they are not bug fixes.

dongjoon-hyun · 2018-02-14T21:29:02Z

No problem.

dongjoon-hyun · 2018-02-14T21:56:12Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+  override def afterAll(): Unit = {
+    spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
+    super.afterAll()
+  }


The test coverage is the same.

gatorsmile · 2018-02-14T23:48:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+
+  override def afterAll(): Unit = {
+    spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
+    super.afterAll()


try { spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION) } finally { super.afterAll() }

Thanks. Yep. It's done.

gatorsmile · 2018-02-14T23:50:50Z

docs/sql-programming-guide.md


+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC serde table (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is set to `true`.


table -> tables

is set to true -> is also set to true?

Ur, there is multiple is set to true. Which part do you mean?

viirya · 2018-02-15T00:37:12Z

docs/sql-programming-guide.md

+    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
+  </tr>
+</table>
+


The description of spark.sql.orc.filterPushdown is disappeared?

Yes. It's disabled back. @viirya

SparkQA · 2018-02-15T01:04:35Z

Test build #87453 has finished for PR 20610 at commit 46c8697.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with BeforeAndAfterAll

dongjoon-hyun · 2018-02-15T01:16:09Z

docs/sql-programming-guide.md

+native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is set to `true`.


@viirya . I split into multiple lines. Could you point out once more?

when spark.sql.hive.convertMetastoreOrc is (also) set to true?

Thank you. I see.

SparkQA · 2018-02-15T03:39:04Z

Test build #87460 has finished for PR 20610 at commit 2769633.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-15T05:53:38Z

Test build #87468 has finished for PR 20610 at commit 183ec21.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-15T06:05:19Z

Retest this please.

gatorsmile · 2018-02-15T07:24:31Z

docs/sql-programming-guide.md

+  <tr>
+    <td><code>spark.sql.orc.impl</code></td>
+    <td><code>hive</code></td>
+    <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1 which is used prior to Spark 2.3.</td>


Remove which is used prior to Spark 2.3?

SparkQA · 2018-02-15T08:05:02Z

Test build #87471 has finished for PR 20610 at commit 183ec21.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-15T09:19:40Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

 class FileStreamSinkSuite extends StreamTest {
  import testImplicits._

+  override def beforeAll(): Unit = {


nit: a simpler way to fix this

override val conf = super.conf.copy(SQLConf.ORC_IMPLEMENTATION -> "native")

Hi, @cloud-fan .
I tested it, but that doesn't work in this FileStreamSinkSuite.

SparkQA · 2018-02-15T11:21:04Z

Test build #87475 has finished for PR 20610 at commit 19b50b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-15T16:55:13Z

LGTM

Thanks! Merged to master/2.3

## What changes were proposed in this pull request? To prevent any regressions, this PR changes ORC implementation to `hive` by default like Spark 2.2.X. Users can enable `native` ORC. Also, ORC PPD is also restored to `false` like Spark 2.2.X. ![orc_section](https://user-images.githubusercontent.com/9700541/36221575-57a1d702-1173-11e8-89fe-dca5842f4ca7.png) ## How was this patch tested? Pass all test cases. Author: Dongjoon Hyun <[email protected]> Closes #20610 from dongjoon-hyun/SPARK-ORC-DISABLE. (cherry picked from commit 2f0498d) Signed-off-by: gatorsmile <[email protected]>

dongjoon-hyun · 2018-02-15T17:00:02Z

Thank you, @gatorsmile , @cloud-fan , and @viirya .

Use 'hive' for ORC

2d74b20

dongjoon-hyun changed the title ~~Use 'hive' for ORC~~ [SPARK-23426][SQL] Use hive ORC implementation for Spark 2.3.0 Feb 14, 2018

gatorsmile reviewed Feb 14, 2018

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-23426][SQL] Use hive ORC implementation for Spark 2.3.0~~ [SPARK-23426][SQL] Use hive ORC impl and disable PPD for Spark 2.3.0 Feb 14, 2018

dongjoon-hyun added 2 commits February 14, 2018 10:33

Disable PPD back.

93e6c7d

Add ORC section.

7ff4ccf

dongjoon-hyun commented Feb 14, 2018

View reviewed changes

fix test case.

46c8697

dongjoon-hyun commented Feb 14, 2018

View reviewed changes

gatorsmile reviewed Feb 14, 2018

View reviewed changes

Address comment.

2769633

viirya reviewed Feb 15, 2018

View reviewed changes

dongjoon-hyun commented Feb 15, 2018

View reviewed changes

Add also.

183ec21

gatorsmile reviewed Feb 15, 2018

View reviewed changes

Remove.

19b50b1

cloud-fan reviewed Feb 15, 2018

View reviewed changes

asfgit closed this in 2f0498d Feb 15, 2018

dongjoon-hyun deleted the SPARK-ORC-DISABLE branch February 15, 2018 17:00


		## ORC Files

		Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC serde table (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is set to `true`.

[SPARK-23426][SQL] Use hive ORC impl and disable PPD for Spark 2.3.0 #20610

[SPARK-23426][SQL] Use hive ORC impl and disable PPD for Spark 2.3.0 #20610

Uh oh!

Conversation

dongjoon-hyun commented Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 14, 2018

Uh oh!

dongjoon-hyun commented Feb 14, 2018

Uh oh!

SparkQA commented Feb 14, 2018

Uh oh!

gatorsmile commented Feb 14, 2018

Uh oh!

dongjoon-hyun commented Feb 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Feb 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

dongjoon-hyun commented Feb 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

cloud-fan Feb 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2018

[SPARK-23426][SQL] Use `hive` ORC impl and disable PPD for Spark 2.3.0 #20610

[SPARK-23426][SQL] Use `hive` ORC impl and disable PPD for Spark 2.3.0 #20610

dongjoon-hyun commented Feb 14, 2018 •

edited

Loading

dongjoon-hyun Feb 14, 2018 •

edited

Loading

viirya Feb 15, 2018 •

edited

Loading

cloud-fan Feb 15, 2018 •

edited

Loading