[SPARK-23313][DOC] Add a migration guide for ORC #20484

dongjoon-hyun · 2018-02-02T06:26:18Z

What changes were proposed in this pull request?

This PR adds a migration guide documentation for ORC.

How was this patch tested?

N/A.

dongjoon-hyun · 2018-02-02T06:29:10Z

cc @gatorsmile and @cloud-fan .

gatorsmile · 2018-02-02T06:34:36Z

docs/sql-programming-guide.md

+          native
+        </td>
+        <td>
+          The name of ORC implementation: 'native' means the native version of ORC support instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.


the native version of ORC support -> the native ORC support that is built on Apache ORC 1.4.1

gatorsmile · 2018-02-02T06:38:14Z

docs/sql-programming-guide.md

+      </tr>
+      <tr>
+        <td>
+          spark.sql.orc.columnarReaderBatchSize


This is not available in 2.3, right?

Oops. My bad.

gatorsmile · 2018-02-02T06:40:12Z

docs/sql-programming-guide.md

+          true
+        </td>
+        <td>
+          Enables the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.


The end users might ask what is the built-in ORC reader/writer?

What about the following?

Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive metastore ORC tables

Hive metastore ORC tables are still not straightforward. : )

I borrowed it from the following in the same doc.

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.

Then, Hive ORC tables?

Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables

SparkQA · 2018-02-02T06:42:08Z

Test build #86969 has finished for PR 20484 at commit 20f99c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T06:50:28Z

Test build #86970 has finished for PR 20484 at commit 1bb23ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T06:58:38Z

Test build #86971 has finished for PR 20484 at commit df08899.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T07:15:32Z

Test build #86973 has finished for PR 20484 at commit 239714a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

TBH, it feels a bit verbose to fit into the migration guide section, perhaps it should have its own ORC section? anyway, just my 2c.

felixcheung · 2018-02-02T07:30:01Z

docs/sql-programming-guide.md


 ## Upgrading From Spark SQL 2.2 to 2.3

+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.


re: the following configurations are newly added or change their default values.
these are all new, right?

The last two are existing one~

felixcheung · 2018-02-02T07:31:13Z

docs/sql-programming-guide.md

+          true
+        </td>
+        <td>
+          Enables vectorized orc decoding in 'native' implementation. If 'false', a new non-vectorized ORC reader is used in 'native' implementation.


should say it doesn't affect the hive implementation perhaps?

felixcheung · 2018-02-02T07:32:10Z

docs/sql-programming-guide.md

+          native
+        </td>
+        <td>
+          The name of ORC implementation: 'native' means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.


use backtick around values?

gatorsmile · 2018-02-02T08:01:02Z

docs/sql-programming-guide.md

+          true
+        </td>
+        <td>
+          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is `false` by default prior to Spark 2.3.


How about?

Enable the Spark's ORC support, which can be configured by spark.sql.orc.impl, instead of ...

Sounds good. I'll update like this.

gatorsmile · 2018-02-02T08:01:56Z

docs/sql-programming-guide.md

+          <code>true</code>
+        </td>
+        <td>
+          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.


How about?

Enable the Spark's ORC support, which can be configured by spark.sql.orc.impl, instead of ...

SparkQA · 2018-02-02T08:05:01Z

Test build #86978 has finished for PR 20484 at commit 7b3b0a4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T08:05:01Z

Test build #86977 has finished for PR 20484 at commit fc5b395.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-02T08:12:29Z

docs/sql-programming-guide.md

+        <th>
+          <b>Meaning</b>
+        </th>
+      </tr>


Can we layout the above html tags similarly with other tables in this doc? E.g.,

<table class="table"> <tr><th>Property Name</th><th>Meaning</th></tr> <tr>

No problem.

viirya · 2018-02-02T08:13:47Z

docs/sql-programming-guide.md

 ## Upgrading From Spark SQL 2.2 to 2.3

+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.
+


Shall we separate newly added configurations and changed ones?

Yep. Now, we have two tables.

viirya · 2018-02-02T08:15:32Z

docs/sql-programming-guide.md

+          <code>native</code>
+        </td>
+        <td>
+          The name of ORC implementation: <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is <code>hive</code> by default prior to Spark 2.3.


I think this is newly added config. So It is <code>hive</code> by default prior to Spark 2.3. sounds like it is an existing config before 2.3.

It's also updated.

SparkQA · 2018-02-02T08:51:41Z

Test build #86984 has finished for PR 20484 at commit cb149f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-02T17:31:29Z

@dongjoon-hyun Do our native readers respect Hive confs? If not, we need to clearly document it. I think this is a common question from the existing users of ORC readers, since they are respected in the prior versions.

dongjoon-hyun · 2018-02-02T20:06:19Z

Actually, I handed over all confs to ORC library. The following is all supported ORC configuration.

https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java

Please note that the confs are interpreted by both Hive Conf Name and ORC Conf Names. Yesterday, @tgravescs asked about hive.exec.orc.split.strategy. For now, it's not registered(recognized) by ORC library according to OrcConf.java.

cc @tgravescs .

dongjoon-hyun · 2018-02-02T22:52:31Z

docs/sql-programming-guide.md

+      </tr>
+    </table>
+
+    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>.


@gatorsmile . I added this note.

We might need to explicitly mention they need to specify the corresponding ORC configuration names when they explicitly or implicitly use the native readers.

For supported confs, OrcConf provides a pair of ORC/Hive key names. ORC keys are recommended but not needed.

STRIPE_SIZE("orc.stripe.size", "hive.exec.orc.default.stripe.size", 64L * 1024 * 1024, "Define the default ORC stripe size, in bytes."),

You mean these hive conf works for our native readers? Could you add test cases for them?

It's possible in another PR. BTW, about the test coverage,

Do you want to see specifically orc.stripe.size and hive.exec.orc.default.stripe.size only?

Do we have a test coverage before for old Hive ORC code path?

You can do a search. We need to improve our ORC test coverage for sure.

If possible, please add test cases to see whether both orc.stripe.size and hive.exec.orc.default.stripe.size work for two Spark's ORC readers. We also need the same tests for checking whether hive.exec.orc.default.stripe.size works for Hive serde tables.

To ensure the correctness of the documentation, I hope we can at least submit a PR for testing them before merging this PR?

Yep. +1. I'll make another PR for that today. @gatorsmile .
(I was wondering if I need to do for all the other Hive/ORC configurations.)

Yes. We can check whether some important conf works

For example,

create table if not exists vectororc (s1 string, s2 string) stored as ORC tblproperties( "orc.row.index.stride"="1000", "hive.exec.orc.default.stripe.size"="100000", "orc.compress.size"="10000");

After auto conversion, do these confs in tblproperties are still being used by our native readers?

We also need to check whether the confs set in the configuration file are also recognized by our native readers.

Any update on this? @dongjoon-hyun

Sorry for late response. @gatorsmile .

Here, your example is a mixed scenario. First of all, I made a PR, #20517, for "Add ORC configuration tests for ORC data source". It adds a test coverage for ORC and Hive configuration names for native and hive OrcFileFormat. The PR aims to focus on name compatibility for those important confs.

For convertMetastoreOrc, the table properties are retained when we check by using spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)). However, it seems to be ignored on some cases. I guess it also does in Parquet. I'm working on it separately.

SparkQA · 2018-02-02T23:04:59Z

Test build #87011 has finished for PR 20484 at commit 436c0f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-07T07:00:46Z

Are they still effective in Hive?

I just want to confirm whether all the Hive readers work fine. Could you add a test case like what we did in CliSuite?

gatorsmile · 2018-02-07T07:02:11Z

@dongjoon-hyun This is still a regression to the existing Hive ORC users. cc @cloud-fan @sameeragarwal Maybe we should fix it before the release?

tgravescs · 2018-02-07T14:26:32Z

docs/sql-programming-guide.md

+      <tr>
+        <td><code>spark.sql.hive.convertMetastoreOrc</code></td>
+        <td><code>true</code></td>
+        <td>Enable the Spark's ORC support, which can be configured by <code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.</td>


this isn't entirely clear to me. I assume this has to be true for spark.sql.orc.impl to work? If so perhaps we should mention it above in spark.sql.orc.impl. If this is false what happens, it can't read Orc format? or it just falls back to spark.sql.orc.impl=hive

Yes. This has to be true for only Hive ORC table.
But, for the other Spark tables created by 'USING ORC', this is irrelevant.

spark.sql.orc.impl and spark.sql.hive.convertMetastoreOrc is orthogonal.
spark.sql.orc.impl=hive and spark.sql.hive.convertMetastoreOrc=true converts Hive ORC tables into legacy OrcFileFormat based on Hive 1.2.1.

tgravescs · 2018-02-07T14:30:21Z

docs/sql-programming-guide.md

+      </tr>
+    </table>
+
+    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, see <a href="https://orc.apache.org/docs/hive-config.html">Hive Configuration</a> of Apache ORC project for a full list of supported ORC configurations.


how does one specify these configurations, is it simply --conf hive.stats.gather.num.threads=8 or do you still have to specify spark.hadoop.hive.stats.gather.num.threads?

It might be more clear here if we say something like: The native ORC implementation uses the Apache ORC 1.4.1 standalone library. This means only a subset of the Hive ORC related configurations are supported. See ...

For user configurations, both spark.hadoop. or hive-site.xml works.

gatorsmile · 2018-02-07T20:13:46Z

Just FYI, #20536 is reverting the conf convertMetastoreOrc back false. However, we still can turn it on by default in 2.3 after we fix the regression.

Thanks!

dongjoon-hyun · 2018-02-07T20:44:15Z

I see. I removed spark.sql.hive.convertMetastoreOrc and Hive ORC table stuff from this PR accordingly. We can add that later if we fix the regresson of convertMetastoreOrc/Parquet.

tgravescs · 2018-02-07T20:49:02Z

why did you remove the bit about the orc configs?

SparkQA · 2018-02-07T21:01:40Z

Test build #87177 has finished for PR 20484 at commit 40c8e02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-07T21:34:23Z

Oh, I thought in this way, @tgravescs .

For ORC files, the parameter names (ORC/Hive) are the same as we see [SPARK-23342][SQL][TEST] Add ORC configuration tests for ORC data source #20517 .
For Hive tables, we don't need a migration because we didn't touch it after Revert [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #20536 . Although users turn on by manually, the table properties of Hive ORC tables are ignored completely without [SPARK-23355][SQL] convertMetastore should not ignore table properties #20522 .

tgravescs · 2018-02-07T22:27:58Z

ok

SparkQA · 2018-02-12T19:55:28Z

Test build #87342 has finished for PR 20484 at commit 59e957a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-12T20:45:38Z

docs/sql-programming-guide.md


 ## Upgrading From Spark SQL 2.2 to 2.3

+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For creating ORC tables, `USING ORC` or `USING HIVE` syntaxes are recommended.


When users create tables by USING HIVE, we are using the ORC library in Hive 1.2.1 to read/write ORC tables unless they manually change spark.sql.hive.convertMetastoreOrc to true.

The last message is confusing to me.

Hm. Right. What about mentioning convertMetastoreOrc is safe with USING HIVE then?

Just describe the scenario in which the new vectorized ORC reader will be used. I think that will be enough.

Okay. I see. Thanks!

gatorsmile · 2018-02-12T22:28:57Z

docs/sql-programming-guide.md


 ## Upgrading From Spark SQL 2.2 to 2.3

+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`. With `spark.sql.hive.convertMetastoreOrc=true`, it will for the tables created by `USING HIVE OPTIONS (fileFormat 'ORC')`, too.


The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader to true. For the Hive ORC serde table (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is set to true.

gatorsmile · 2018-02-12T22:30:05Z

LGTM except the above comment.

SparkQA · 2018-02-12T22:32:47Z

Test build #87351 has finished for PR 20484 at commit 6136d25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-12T22:38:30Z

Test build #87353 has finished for PR 20484 at commit 8ae87fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-12T23:01:34Z

Test build #87354 has finished for PR 20484 at commit 6887d19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR adds a migration guide documentation for ORC. ![orc-guide](https://user-images.githubusercontent.com/9700541/36123859-ec165cae-1002-11e8-90b7-7313be7a81a5.png) ## How was this patch tested? N/A. Author: Dongjoon Hyun <[email protected]> Closes #20484 from dongjoon-hyun/SPARK-23313. (cherry picked from commit 6cb5970) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2018-02-12T23:27:39Z

Thanks! Merged to master and 2.3.

dongjoon-hyun · 2018-02-12T23:36:31Z

Thank you, @gatorsmile , @tgravescs , @felixcheung .

[SPARK-23313][DOC] Add a migration guide for ORC

20f99c6

fix.

1bb23ef

gatorsmile reviewed Feb 2, 2018

View reviewed changes

Address comments

df08899

gatorsmile reviewed Feb 2, 2018

View reviewed changes

Remove spark.sql.orc.columnarReaderBatchSize

0aecd5d

Update

239714a

felixcheung reviewed Feb 2, 2018

View reviewed changes

dongjoon-hyun added 2 commits February 1, 2018 23:47

address comments.

fc5b395

Use <code>

7b3b0a4

gatorsmile reviewed Feb 2, 2018

View reviewed changes

viirya reviewed Feb 2, 2018

View reviewed changes

Split the table.

cb149f2

Address comments

436c0f4

dongjoon-hyun commented Feb 2, 2018

View reviewed changes

tgravescs reviewed Feb 7, 2018

View reviewed changes

Remove spark.sql.hive.convertMetastoreOrc and Hive ORC table stuff.

a693446

remove more.

40c8e02

Add USING syntax recommendation.

59e957a

gatorsmile reviewed Feb 12, 2018

View reviewed changes

dongjoon-hyun added 3 commits February 12, 2018 14:12

update

6136d25

Add USING HIVE OPTIONS description, too.

f2bd2c8

fix

8ae87fc

gatorsmile reviewed Feb 12, 2018

View reviewed changes

Update the description.

6887d19

asfgit closed this in 6cb5970 Feb 12, 2018

dongjoon-hyun deleted the SPARK-23313 branch February 12, 2018 23:36


		## Upgrading From Spark SQL 2.2 to 2.3

		- Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.

[SPARK-23313][DOC] Add a migration guide for ORC #20484

[SPARK-23313][DOC] Add a migration guide for ORC #20484

Uh oh!

Conversation

dongjoon-hyun commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Feb 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

gatorsmile commented Feb 2, 2018

Uh oh!

dongjoon-hyun commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dongjoon-hyun commented Feb 2, 2018 •

edited

Loading

dongjoon-hyun commented Feb 2, 2018 •

edited

Loading

dongjoon-hyun Feb 3, 2018 •

edited

Loading

gatorsmile Feb 3, 2018 •

edited

Loading