Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 14, 2018

What changes were proposed in this pull request?

To prevent any regressions, this PR changes ORC implementation to hive by default like Spark 2.2.X.
Users can enable native ORC. Also, ORC PPD is also restored to false like Spark 2.2.X.

orc_section

How was this patch tested?

Pass all test cases.

@dongjoon-hyun dongjoon-hyun changed the title Use 'hive' for ORC [SPARK-23426][SQL] Use hive ORC implementation for Spark 2.3.0 Feb 14, 2018
.stringConf
.checkValues(Set("hive", "native"))
.createWithDefault("native")
.createWithDefault("hive")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to disable the ORC pushdown, because the ORC reader of Hive 1.2.1 has a few bugs.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we don't have a test case for that, do we? Actually, I want to have a test case for that.

<tr>
<td><code>spark.sql.orc.impl</code></td>
<td><code>native</code></td>
<td><code>hive</code></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need this in the migration guide. Please create a new section for ORC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the impl was changed back to the old implementation? this breaks spark.read.orc

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23426][SQL] Use hive ORC implementation for Spark 2.3.0 [SPARK-23426][SQL] Use hive ORC impl and disable PPD for Spark 2.3.0 Feb 14, 2018
<td><code>true</code></td>
<td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
</tr>
</table>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . Now, this becomes a section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@SparkQA
Copy link

SparkQA commented Feb 14, 2018

Test build #87450 has finished for PR 20610 at commit 2d74b20.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Ur, can I make another PR to fix the test failures?

Error Message
org.apache.spark.sql.AnalysisException: Hive built-in ORC data source must be used with Hive support enabled. Please use the native ORC data source by setting 'spark.sql.orc.impl' to 'native';

@SparkQA
Copy link

SparkQA commented Feb 14, 2018

Test build #87451 has finished for PR 20610 at commit 7ff4ccf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

I think it makes sense to fix the test cases in the same PR, as long as they are not bug fixes.

@dongjoon-hyun
Copy link
Member Author

No problem.

override def afterAll(): Unit = {
spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
super.afterAll()
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage is the same.


override def afterAll(): Unit = {
spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
super.afterAll()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    try {
      spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
    } finally {
      super.afterAll()
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yep. It's done.


## ORC Files

Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC serde table (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is set to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table -> tables

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is set to true -> is also set to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, there is multiple is set to true. Which part do you mean?

<td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
</tr>
</table>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of spark.sql.orc.filterPushdown is disappeared?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's disabled back. @viirya

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87453 has finished for PR 20610 at commit 46c8697.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with BeforeAndAfterAll

native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is set to `true`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya . I split into multiple lines. Could you point out once more?

Copy link
Member

@viirya viirya Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when spark.sql.hive.convertMetastoreOrc is (also) set to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I see.

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87460 has finished for PR 20610 at commit 2769633.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87468 has finished for PR 20610 at commit 183ec21.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

<tr>
<td><code>spark.sql.orc.impl</code></td>
<td><code>hive</code></td>
<td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1 which is used prior to Spark 2.3.</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove which is used prior to Spark 2.3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87471 has finished for PR 20610 at commit 183ec21.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

class FileStreamSinkSuite extends StreamTest {
import testImplicits._

override def beforeAll(): Unit = {
Copy link
Contributor

@cloud-fan cloud-fan Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a simpler way to fix this

override val conf = super.conf.copy(SQLConf.ORC_IMPLEMENTATION -> "native")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
I tested it, but that doesn't work in this FileStreamSinkSuite.

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87475 has finished for PR 20610 at commit 19b50b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Feb 15, 2018
## What changes were proposed in this pull request?

To prevent any regressions, this PR changes ORC implementation to `hive` by default like Spark 2.2.X.
Users can enable `native` ORC. Also, ORC PPD is also restored to `false` like Spark 2.2.X.

![orc_section](https://user-images.githubusercontent.com/9700541/36221575-57a1d702-1173-11e8-89fe-dca5842f4ca7.png)

## How was this patch tested?

Pass all test cases.

Author: Dongjoon Hyun <[email protected]>

Closes #20610 from dongjoon-hyun/SPARK-ORC-DISABLE.

(cherry picked from commit 2f0498d)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 2f0498d Feb 15, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile , @cloud-fan , and @viirya .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-ORC-DISABLE branch February 15, 2018 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants