Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 2, 2018

What changes were proposed in this pull request?

This PR adds a migration guide documentation for ORC.

orc-guide

How was this patch tested?

N/A.

@dongjoon-hyun
Copy link
Member Author

cc @gatorsmile and @cloud-fan .

native
</td>
<td>
The name of ORC implementation: 'native' means the native version of ORC support instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the native version of ORC support -> the native ORC support that is built on Apache ORC 1.4.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

</tr>
<tr>
<td>
spark.sql.orc.columnarReaderBatchSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not available in 2.3, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. My bad.

true
</td>
<td>
Enables the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end users might ask what is the built-in ORC reader/writer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the following?

Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive metastore ORC tables

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive metastore ORC tables are still not straightforward. : )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrowed it from the following in the same doc.

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, Hive ORC tables?

Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86969 has finished for PR 20484 at commit 20f99c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86970 has finished for PR 20484 at commit 1bb23ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86971 has finished for PR 20484 at commit df08899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86973 has finished for PR 20484 at commit 239714a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, it feels a bit verbose to fit into the migration guide section, perhaps it should have its own ORC section? anyway, just my 2c.


## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: the following configurations are newly added or change their default values.
these are all new, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two are existing one~

true
</td>
<td>
Enables vectorized orc decoding in 'native' implementation. If 'false', a new non-vectorized ORC reader is used in 'native' implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should say it doesn't affect the hive implementation perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

native
</td>
<td>
The name of ORC implementation: 'native' means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use backtick around values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

true
</td>
<td>
Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is `false` by default prior to Spark 2.3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Enable the Spark's ORC support, which can be configured by spark.sql.orc.impl, instead of ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll update like this.

<code>true</code>
</td>
<td>
Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Enable the Spark's ORC support, which can be configured by spark.sql.orc.impl, instead of ...

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86978 has finished for PR 20484 at commit 7b3b0a4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86977 has finished for PR 20484 at commit fc5b395.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

<th>
<b>Meaning</b>
</th>
</tr>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we layout the above html tags similarly with other tables in this doc? E.g.,

<table class="table">
  <tr><th>Property Name</th><th>Meaning</th></tr>
  <tr>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we separate newly added configurations and changed ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Now, we have two tables.

<code>native</code>
</td>
<td>
The name of ORC implementation: <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is <code>hive</code> by default prior to Spark 2.3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is newly added config. So It is <code>hive</code> by default prior to Spark 2.3. sounds like it is an existing config before 2.3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also updated.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86984 has finished for PR 20484 at commit cb149f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@dongjoon-hyun Do our native readers respect Hive confs? If not, we need to clearly document it. I think this is a common question from the existing users of ORC readers, since they are respected in the prior versions.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 2, 2018

Actually, I handed over all confs to ORC library. The following is all supported ORC configuration.

Please note that the confs are interpreted by both Hive Conf Name and ORC Conf Names. Yesterday, @tgravescs asked about hive.exec.orc.split.strategy. For now, it's not registered(recognized) by ORC library according to OrcConf.java.

cc @tgravescs .

</tr>
</table>

- Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I added this note.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to explicitly mention they need to specify the corresponding ORC configuration names when they explicitly or implicitly use the native readers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For supported confs, OrcConf provides a pair of ORC/Hive key names. ORC keys are recommended but not needed.

  STRIPE_SIZE("orc.stripe.size", "hive.exec.orc.default.stripe.size",
      64L * 1024 * 1024, "Define the default ORC stripe size, in bytes."),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean these hive conf works for our native readers? Could you add test cases for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible in another PR. BTW, about the test coverage,

  • Do you want to see specifically orc.stripe.size and hive.exec.orc.default.stripe.size only?
  • Do we have a test coverage before for old Hive ORC code path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do a search. We need to improve our ORC test coverage for sure.

If possible, please add test cases to see whether both orc.stripe.size and hive.exec.orc.default.stripe.size work for two Spark's ORC readers. We also need the same tests for checking whether hive.exec.orc.default.stripe.size works for Hive serde tables.

To ensure the correctness of the documentation, I hope we can at least submit a PR for testing them before merging this PR?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. +1. I'll make another PR for that today. @gatorsmile .
(I was wondering if I need to do for all the other Hive/ORC configurations.)

Copy link
Member

@gatorsmile gatorsmile Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can check whether some important conf works

For example,

create table if not exists vectororc (s1 string, s2 string)
stored as ORC tblproperties(
  "orc.row.index.stride"="1000", 
  "hive.exec.orc.default.stripe.size"="100000",
   "orc.compress.size"="10000");

After auto conversion, do these confs in tblproperties are still being used by our native readers?

We also need to check whether the confs set in the configuration file are also recognized by our native readers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update on this? @dongjoon-hyun

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late response. @gatorsmile .

Here, your example is a mixed scenario. First of all, I made a PR, #20517, for "Add ORC configuration tests for ORC data source". It adds a test coverage for ORC and Hive configuration names for native and hive OrcFileFormat. The PR aims to focus on name compatibility for those important confs.

For convertMetastoreOrc, the table properties are retained when we check by using spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)). However, it seems to be ignored on some cases. I guess it also does in Parquet. I'm working on it separately.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #87011 has finished for PR 20484 at commit 436c0f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Are they still effective in Hive?

I just want to confirm whether all the Hive readers work fine. Could you add a test case like what we did in CliSuite?

@gatorsmile
Copy link
Member

@dongjoon-hyun This is still a regression to the existing Hive ORC users. cc @cloud-fan @sameeragarwal Maybe we should fix it before the release?

<tr>
<td><code>spark.sql.hive.convertMetastoreOrc</code></td>
<td><code>true</code></td>
<td>Enable the Spark's ORC support, which can be configured by <code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't entirely clear to me. I assume this has to be true for spark.sql.orc.impl to work? If so perhaps we should mention it above in spark.sql.orc.impl. If this is false what happens, it can't read Orc format? or it just falls back to spark.sql.orc.impl=hive

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This has to be true for only Hive ORC table.
But, for the other Spark tables created by 'USING ORC', this is irrelevant.

spark.sql.orc.impl and spark.sql.hive.convertMetastoreOrc is orthogonal.
spark.sql.orc.impl=hive and spark.sql.hive.convertMetastoreOrc=true converts Hive ORC tables into legacy OrcFileFormat based on Hive 1.2.1.

</tr>
</table>

- Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, see <a href="https://orc.apache.org/docs/hive-config.html">Hive Configuration</a> of Apache ORC project for a full list of supported ORC configurations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does one specify these configurations, is it simply --conf hive.stats.gather.num.threads=8 or do you still have to specify spark.hadoop.hive.stats.gather.num.threads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more clear here if we say something like: The native ORC implementation uses the Apache ORC 1.4.1 standalone library. This means only a subset of the Hive ORC related configurations are supported. See ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user configurations, both spark.hadoop. or hive-site.xml works.

@gatorsmile
Copy link
Member

Just FYI, #20536 is reverting the conf convertMetastoreOrc back false. However, we still can turn it on by default in 2.3 after we fix the regression.

Thanks!

@dongjoon-hyun
Copy link
Member Author

I see. I removed spark.sql.hive.convertMetastoreOrc and Hive ORC table stuff from this PR accordingly. We can add that later if we fix the regresson of convertMetastoreOrc/Parquet.

@tgravescs
Copy link
Contributor

why did you remove the bit about the orc configs?

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87177 has finished for PR 20484 at commit 40c8e02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Oh, I thought in this way, @tgravescs .

@tgravescs
Copy link
Contributor

ok

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87342 has finished for PR 20484 at commit 59e957a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For creating ORC tables, `USING ORC` or `USING HIVE` syntaxes are recommended.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users create tables by USING HIVE, we are using the ORC library in Hive 1.2.1 to read/write ORC tables unless they manually change spark.sql.hive.convertMetastoreOrc to true.

The last message is confusing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Right. What about mentioning convertMetastoreOrc is safe with USING HIVE then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just describe the scenario in which the new vectorized ORC reader will be used. I think that will be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I see. Thanks!


## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`. With `spark.sql.hive.convertMetastoreOrc=true`, it will for the tables created by `USING HIVE OPTIONS (fileFormat 'ORC')`, too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader to true. For the Hive ORC serde table (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is set to true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@gatorsmile
Copy link
Member

LGTM except the above comment.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87351 has finished for PR 20484 at commit 6136d25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87353 has finished for PR 20484 at commit 8ae87fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87354 has finished for PR 20484 at commit 6887d19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Feb 12, 2018
## What changes were proposed in this pull request?

This PR adds a migration guide documentation for ORC.

![orc-guide](https://user-images.githubusercontent.com/9700541/36123859-ec165cae-1002-11e8-90b7-7313be7a81a5.png)

## How was this patch tested?

N/A.

Author: Dongjoon Hyun <[email protected]>

Closes #20484 from dongjoon-hyun/SPARK-23313.

(cherry picked from commit 6cb5970)
Signed-off-by: gatorsmile <[email protected]>
@gatorsmile
Copy link
Member

Thanks! Merged to master and 2.3.

@asfgit asfgit closed this in 6cb5970 Feb 12, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile , @tgravescs , @felixcheung .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23313 branch February 12, 2018 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants