[SPARK-23817][SQL]Migrate ORC file format read path to data source V2 #20933

gengliangwang · 2018-03-29T09:11:09Z

What changes were proposed in this pull request?

Migrate ORC file format read path to data source V2.

Supports:

Scan ColumnarBatch
Scan UnsafeRow
Push down filters
Push down required columns

Not supported( due to limitation of data source V2):

Read multiple file path
Read bucketed file.

How was this patch tested?

unit test

SparkQA · 2018-03-29T11:03:46Z

Test build #88702 has finished for PR 20933 at commit 40b33c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FallBackToOrcV1(sparkSession: SparkSession) extends Rule[LogicalPlan]
class EmptyDataReader[T] extends DataReader[T]
case class OrcBatchDataReaderFactory(
case class OrcColumnarBatchDataReader(iter: Iterator[InternalRow])
class OrcDataSourceV2 extends DataSourceV2 with ReadSupport with ReadSupportWithSchema
class OrcDataSourceReader(options: DataSourceOptions, userSpecifiedSchema: Option[StructType])
case class OrcUnsafeRowReaderFactory(
case class OrcUnsafeRowDataReader(iter: Iterator[InternalRow])

gatorsmile · 2018-03-29T15:59:38Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

requestedPartitionColds -> requestedPartitionColIds

gatorsmile · 2018-03-29T16:02:19Z

Let us trigger more tests by changing spark.sql.sources.default to orc and see whether all the tests can pass.

SparkQA · 2018-03-29T18:27:13Z

Test build #88711 has finished for PR 20933 at commit a3e084a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-03-30T05:12:29Z

@jose-torres @cloud-fan

jose-torres

A few high level questions.

jose-torres · 2018-03-30T05:28:33Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

It seems weird that DataFrameReader is modified here. Will DataSourceV2 implementations generally need to modify DataFrameReader, or is it just a temporary hack because of the mentioned lack of support? In the latter case, is there a plan to add this support soon?

This is temporary hack. I think @cloud-fan will create a PR to support reading multiple files recently.

yea it's a temporary hack. We will support multi-path soon.

What about bucketed reads? WIll they need a similar change here, or is that lack of support handled elsewhere? (Or am I misunderstanding something about that part of the description - I'm not super familiar with the ORC source)

We only support bucket with tables, while data source v2 can't work with tables now.

How does v2 not support reading multiple files?

because we need to define how to pass multiple paths via options. I have a PR to fix it, I'll bring it update to date.

jose-torres · 2018-03-30T05:30:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

Would it make sense to split the refactoring changes into their own PR? It's hard to tell at a glance which parts of the change are refactoring and which are new V2 implementation.

Yes that is a good idea.

I'm afraid these refactors only make sense in this PR, for reusing code between v1 and v2.

Why does this only make sense for this PR? It looks like this is a reasonable refactor that could be stand-alone.

moving code need a reason. The reason here is to help us to reuse the code. But if we do it in another PR, what is the reason? It doesn't make the code more clear IMO.

Better organization to support other changes like this one is the reason.

@jose-torres was right to point out that these changes are self-contained enough to go in a separate PR and @gengliangwang and I both agreed. Why make this commit larger than necessary?

if we agree that a separated PR is self-contained as it can help this PR, I'm also OK with it.

Yeah, I think the commit itself would be self-contained reorganization. The motivation is to refactor for this PR, which is okay.

jose-torres · 2018-03-30T05:31:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

Shouldn't Spark handle this already when it sees that OrcDataSourceV2 doesn't implement WriteSupport?

Not for InsertIntoTable.

Is this also a temporary hack, then? It seems like Spark should know it can't write to a source which doesn't implement WriteSupport, no matter what the shape of the query performing the write is.

This is a tempory hack. v1 has the same problem: when inserting into a table which is backed by a non-writable data source, Spark would fail during planning.

I agree with @jose-torres. If there is a general problem when writes aren't supported, then shouldn't this be a generic rule that provides a good error message?

cloud-fan · 2018-03-30T14:56:40Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

param doc for them.

cloud-fan · 2018-03-30T14:57:50Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

it's not only missing columns not, but also partition columns

cloud-fan · 2018-03-30T15:12:16Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcUnsafeRowReaderFactory.scala

there are a lot of duplication between this and OrcBatchDataReaderFactory, and we may have more when migrating other file formats.

SparkQA · 2018-03-31T04:18:23Z

Test build #88773 has finished for PR 20933 at commit 29de999.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-03T16:19:05Z

also cc @rdblue

SparkQA · 2018-04-03T19:04:07Z

Test build #88843 has finished for PR 20933 at commit e4cd8a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-04T02:36:12Z

Test build #88859 has finished for PR 20933 at commit ffbf2f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-04T18:35:48Z

Test build #88897 has finished for PR 20933 at commit 35b74c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-05T03:36:50Z

retest this please.

SparkQA · 2018-04-05T06:02:26Z

Test build #88923 has finished for PR 20933 at commit 35b74c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-05T13:32:17Z

retest this please.

cloud-fan · 2018-04-05T14:56:47Z

retest this please

rdblue · 2018-04-05T16:05:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileReaderFactory.scala

Why is this class public? Isn't this internal to HadoopFsRelation's v2 implementation?

cloud-fan · 2018-04-05T15:28:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

spark.sql.disabledV2DataSources

This is following disabledV2StreamingMicroBatchReaders. And currently this PR only supports reading.

We need a better name.

cloud-fan · 2018-04-05T15:29:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

its v2 implementation is disabled. Reads from these sources will fall back to the V1 implementation.

cloud-fan · 2018-04-05T15:37:27Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

This implies requestedColIds[i] and requestedPartitionColIds[i] may both be non-negative, is it possible?

Yes, here requestedColIds means the actual required columns, including the partition columns.

we should also apply this check in the copyToSpark branch

also add comment for this.

cloud-fan · 2018-04-05T15:40:43Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

We only support bucket with tables, while data source v2 can't work with tables now.

cloud-fan · 2018-04-05T16:00:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

We can also move this log to PartitionedFileUtil.maxSplitBytes

cloud-fan · 2018-04-05T16:04:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

don't hardcode it

cloud-fan · 2018-04-05T16:05:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartitionUtil.scala

why we have both PartitionedFileUtil and FilePartitionUtil?

PartitionedFileUtil is about how we get PartitionedFile.
FilePartitionUtil is about how we get FilePartition and convert them to InternalRow

cloud-fan · 2018-04-05T16:09:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileReaderFactory.scala

explain why this works, i.e. we use type erase hack to return columnar batch.

cloud-fan · 2018-04-05T16:11:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala

which file source doesn't support this? I think all file sources support partitioning.

I thought some of the file source would choose SupportsPushDownFilters instead of SupportsPushDownCatalystFilters. Not very sure about this.

cloud-fan · 2018-04-05T16:13:02Z

...c/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcPartitionDiscoverySuite.scala

we should not remove tests

rdblue · 2018-04-05T16:19:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala

I think it is a bad idea to continue using PartitionedFile => Iterator[InternalRow] in v2.

I understand not wanting to change much about how this works, just to get the code behind the v2 API. But this pattern is broken and causes resource problems that the v2 API nudges implementers to fix.

What resource problems? This doesn't implement close properly, forcing close to happen at task end by calling functions registered when files are opened. We've gone back through and replaced the iterators with closeable versions so that we release resources more quickly because the callback-based close does not scale.

I would like to see this problem fixed instead of copying it into v2.

Yes, I am quite frustrated when I update the code and use PartitionedFile => Iterator[InternalRow] as V1 did.
I was trying to reduce duplicated code between vectorized reader and unsafe row reader. And we can reuse the code in FileScanRDD.
I know this makes the V2 implementation meaningless. I will keep finding a good solution.

btw i think it's also ok if we know what we want in the final version, and the intermediate change tries to minimize code changes (i haven't looked at the pr at all so don't interpret this comment as endorsing or not endorsing the pr design)

With #21029, we can get rid of this.

rdblue · 2018-04-05T16:20:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartitionUtil.scala

Why is this named "compute" and not "open" or something more specific?

rdblue · 2018-04-05T16:22:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Why is this necessary?

Just in case that this implementation has bug or regression. Following DISABLED_V2_STREAMING_MICROBATCH_READERS

SparkQA · 2018-04-05T18:40:54Z

Test build #88943 has finished for PR 20933 at commit 35b74c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-09T14:02:30Z

Test build #89061 has finished for PR 20933 at commit 9bde159.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-09T17:37:17Z

Discussed with @cloud-fan offline. The conclusion is that we can use simple factory pattern for data factory, so that we can avoid redundant code easily and stop using PartitionedFile => Iterator[InternalRow].
He has created #21029. I will continue updating this one after his PR merged.

SparkQA · 2018-04-16T16:58:23Z

Test build #89392 has finished for PR 20933 at commit 80b36f3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-14T15:15:10Z

Test build #90584 has finished for PR 20933 at commit 67b1748.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tengpeng · 2018-07-07T18:38:52Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+        val jobId = new SimpleDateFormat("yyyyMMddHHmmss", Locale.US)
+          .format(new Date()) + "-" + UUID.randomUUID()
+        val writer = ds.asInstanceOf[WriteSupport]
+          .createWriter(jobId, df.logicalPlan.schema, mode, options)


I am not sure I understand this: why do use .createWriter here, but we do not use .createReader in DataFrameReader. It seems "unsymmetrical" to me.

It is. We're still evolving the v2 API and integration with Spark. This problem is addressed in PR #21305, which is the first of a series of changes to standardize the logical plans and fix problems like this one.

There's also an open proposal for those changes.

gengliangwang · 2018-07-11T03:51:12Z

Status update: we are working on new proposal for changing the Data source API, to resolve the problems exposed in this PR.
Before the new proposal is adopted or denied, this PR remains open.

gatorsmile reviewed Mar 29, 2018

View reviewed changes

jose-torres reviewed Mar 30, 2018

View reviewed changes

cloud-fan reviewed Mar 30, 2018

View reviewed changes

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java Outdated

Copy link

Contributor

cloud-fan Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param doc for them.

cloud-fan reviewed Mar 30, 2018

View reviewed changes

rdblue reviewed Apr 5, 2018

View reviewed changes

cloud-fan reviewed Apr 5, 2018

View reviewed changes

rdblue reviewed Apr 5, 2018

View reviewed changes

gengliangwang force-pushed the orcV2 branch from 35b74c0 to 9bde159 Compare April 9, 2018 13:41

gengliangwang mentioned this pull request Apr 10, 2018

[SPARK-23952] remove type parameter in DataReaderFactory #21029

Closed

gengliangwang force-pushed the orcV2 branch from 9bde159 to 80b36f3 Compare April 16, 2018 11:26

gengliangwang added 24 commits May 13, 2018 22:34

revise code

c6aac35

refactor constructor

72937ec

fix SQLQuerySuite of hive

d6128ae

address comments

5d00dd7

refactor

2561a72

add trait FileSourceReader

a59f54e

Revise

0fbb091

revise

9936333

revise fileIndex

5dba7e2

use inferSchema

a6dba6a

address comments

4955714

better hack for lookupDataSource

7bc9951

fix compilation

31fe235

revise

8894bbd

revise

3f0a67c

revise

c90f1f7

update to latest master

ba84051

add FileDataSourceV2 and better fall back

c91db6d

fix test failure

6dc1f10

fix test failure

eee5c94

add FileDataSourceV2 and fall back mechanism

edbe034

refactor FileReaderFactory and FileSourceReader

aa80f6f

handle ignoreCorruptFiles and ignoreMissingFiles

cca4321

update to latest master

67b1748

gengliangwang force-pushed the orcV2 branch from 2190d9d to 67b1748 Compare May 14, 2018 10:30

tengpeng reviewed Jul 7, 2018

View reviewed changes

gengliangwang mentioned this pull request Dec 26, 2018

[SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path #23383

Closed

gengliangwang closed this Mar 29, 2019

[SPARK-23817][SQL]Migrate ORC file format read path to data source V2 #20933

[SPARK-23817][SQL]Migrate ORC file format read path to data source V2 #20933

Uh oh!

Conversation

gengliangwang commented Mar 29, 2018

What changes were proposed in this pull request?

Supports:

Not supported( due to limitation of data source V2):

How was this patch tested?

Uh oh!

SparkQA commented Mar 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 29, 2018

Uh oh!

gengliangwang commented Mar 30, 2018

Uh oh!

jose-torres left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2018

Uh oh!

cloud-fan commented Apr 3, 2018

Uh oh!

SparkQA commented Apr 3, 2018

Uh oh!

SparkQA commented Apr 4, 2018

Uh oh!

SparkQA commented Apr 4, 2018

Uh oh!

gatorsmile commented Mar 29, 2018 •

edited

Loading

gengliangwang Apr 5, 2018 •

edited

Loading