[SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation #6090

liancheng · 2015-05-12T18:42:49Z

This PR migrates Parquet data source to the newly introduced FSBasedRelation. FSBasedParquetRelation is created to replace ParquetRelation2. Major differences are:

Partition discovery code has been factored out to FSBasedRelation
AppendingParquetOutputFormat is not used now. Instead, an anonymous subclass of ParquetOutputFormat is used to handle appending and writing dynamic partitions
When scanning partitioned tables, FSBasedParquetRelation.buildScan only builds an RDD[Row] for a single selected partition
FSBasedParquetRelation doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend CatalystScan anymore

After migrating JSONRelation (which extends CatalystScan), we can remove CatalystScan.

liancheng · 2015-05-12T18:44:26Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

The original code doesn't handle file names like parquet-r-00001.gz.parquet.

AmplabJenkins · 2015-05-12T18:47:12Z

Merged build triggered.

AmplabJenkins · 2015-05-12T18:47:18Z

Merged build started.

SparkQA · 2015-05-12T18:48:02Z

Test build #32523 has started for PR 6090 at commit f4482ca.

SparkQA · 2015-05-12T19:00:19Z

Test build #32523 has finished for PR 6090 at commit f4482ca.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-12T19:00:27Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-12T19:00:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32523/
Test FAILed.

SparkQA · 2015-05-12T19:35:46Z

Test build #802 has started for PR 6090 at commit f4482ca.

SparkQA · 2015-05-12T19:46:06Z

Test build #802 has finished for PR 6090 at commit f4482ca.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-12T23:07:12Z

Merged build triggered.

AmplabJenkins · 2015-05-12T23:07:21Z

Merged build started.

SparkQA · 2015-05-12T23:07:48Z

Test build #32549 has started for PR 6090 at commit e40bb7b.

SparkQA · 2015-05-13T01:28:39Z

Test build #32549 has finished for PR 6090 at commit e40bb7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-13T01:28:44Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-13T01:28:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32549/
Test PASSed.

yhuai · 2015-05-13T04:11:10Z

sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala

The return type does not need to be a FileOutputCommitter, right?

I used FileOutputCommitter here because we need to retrieve the actual path of file being written, which is returned by FileOutputCommitter.getWorkPath. This implies customized output committers must be subclasses of FileOutputCommitter, which was true for DirectParquetOutputCommitter. But this restriction seems too strict. Resorting to OutputCommitter rather than FileOutputCommitter in another PR.

AmplabJenkins · 2015-05-13T14:47:11Z

Merged build triggered.

AmplabJenkins · 2015-05-13T14:47:21Z

Merged build started.

liancheng · 2015-05-13T14:50:23Z

Rebased to #6118.

AmplabJenkins · 2015-05-13T14:52:13Z

Merged build triggered.

AmplabJenkins · 2015-05-13T14:52:21Z

Merged build started.

SparkQA · 2015-05-13T14:53:04Z

Test build #32620 has started for PR 6090 at commit 3d770f4.

AmplabJenkins · 2015-05-13T15:02:22Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-13T15:02:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32618/
Test FAILed.

SparkQA · 2015-05-13T16:11:38Z

Test build #803 has started for PR 6090 at commit 3d770f4.

yhuai · 2015-05-13T16:35:05Z

sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala

classOf[OutputCommitter]?

Yes, thanks!

yhuai · 2015-05-13T16:38:12Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/fsBasedParquet.scala

Seems this comment is outdated?

No. When FileOutputCommitter is used, we still use FileOutputCommitter.getWorkPath() internally inside InsertIntoFSBasedRelation.

AmplabJenkins · 2015-05-13T16:42:10Z

Merged build triggered.

AmplabJenkins · 2015-05-13T16:42:17Z

Merged build started.

SparkQA · 2015-05-13T16:44:10Z

Test build #32626 has started for PR 6090 at commit 6063f87.

SparkQA · 2015-05-13T17:14:17Z

Test build #32620 has finished for PR 6090 at commit 3d770f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-13T17:14:22Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-13T17:14:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32620/
Test PASSed.

yhuai · 2015-05-13T17:15:40Z

sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala

Seems this constructor is not defined in OutputCommitter.

Why do we need this change?

Actually, in InsertIntoFSBasedRelation.run, we already set the the output path through FileOutputFormat.setOutputPath(job, qualifiedOutputPath). So, the output path should be set in context. Seems we only need to check if mapred.output.committer.class is set or not. If it is set, we create the output committer based on the specified class.

val committerClass = context.getConfiguration.getClass( "mapred.output.committer.class", null, classOf[OutputCommitter]) Option(committerClass).map { clazz => val ctor = clazz.getDeclaredConstructor() ctor.newInstance() }.getOrElse { outputFormatClass.newInstance().getOutputCommitter(context) }

Actually, if committerClass is based on mapred interface, setupJob will not work because mapred's output committer use mapred JobContext (a subclass of mapreduce's JobContext) and we are using Job in mapreduce package (another subclass of mapreduce's JobContext).

marmbrus · 2015-05-13T18:03:10Z

I'm going to go ahead and merge this so we can start testing. Yin's concerns about using other output committers can be addressed in a followup (we should consider adding tests that use DirectOutputCommitter)

This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are: 1. Partition discovery code has been factored out to `FSBasedRelation` 1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions 1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition 1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)  Author: Cheng Lian <[email protected]> Closes #6090 from liancheng/parquet-migration and squashes the following commits: 6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist 261d8c1 [Cheng Lian] Minor bug fix and more tests db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation (cherry picked from commit 7ff16e8) Signed-off-by: Michael Armbrust <[email protected]>

SparkQA · 2015-05-13T18:34:02Z

Test build #803 has finished for PR 6090 at commit 3d770f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-13T19:13:27Z

Test build #32626 has finished for PR 6090 at commit 6063f87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-13T19:13:33Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-13T19:13:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32626/
Test PASSed.

This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are: 1. Partition discovery code has been factored out to `FSBasedRelation` 1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions 1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition 1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)  Author: Cheng Lian <[email protected]> Closes apache#6090 from liancheng/parquet-migration and squashes the following commits: 6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist 261d8c1 [Cheng Lian] Minor bug fix and more tests db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation

liancheng reviewed May 12, 2015
View reviewed changes

yhuai reviewed May 13, 2015
View reviewed changes

liancheng force-pushed the parquet-migration branch from e40bb7b to a0a3ee9 Compare May 13, 2015 14:44

yhuai reviewed May 13, 2015
View reviewed changes

liancheng added 3 commits May 14, 2015 00:38

Migrates Parquet data source to FSBasedRelation

db65660

Minor bug fix and more tests

261d8c1

Adds ParquetRelation2 related classes to MiMa check whitelist

f9ea56e

liancheng added 2 commits May 14, 2015 00:38

Fixes compilation error introduced while rebasing

bfd1cf0

Casts to OutputCommitter rather than FileOutputCommtter

6063f87

yhuai reviewed May 13, 2015
View reviewed changes

liancheng force-pushed the parquet-migration branch from ff22b46 to 6063f87 Compare May 13, 2015 16:39

yhuai reviewed May 13, 2015
View reviewed changes

asfgit closed this in 7ff16e8 May 13, 2015

viirya mentioned this pull request May 14, 2015

[SPARK-7447][SQL] Don't re-merge Parquet schema when the relation is deserialized #6012

Closed

schlosna mentioned this pull request Nov 23, 2016

Fix createFilter for NOT queries palantir/spark#68

Merged

[SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation #6090

[SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation #6090

Uh oh!

Conversation

liancheng commented May 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 12, 2015

Uh oh!

AmplabJenkins commented May 12, 2015

Uh oh!

SparkQA commented May 12, 2015

Uh oh!

SparkQA commented May 12, 2015

Uh oh!

AmplabJenkins commented May 12, 2015

Uh oh!

AmplabJenkins commented May 12, 2015

Uh oh!

SparkQA commented May 12, 2015

Uh oh!

SparkQA commented May 12, 2015

Uh oh!

AmplabJenkins commented May 12, 2015

Uh oh!

AmplabJenkins commented May 12, 2015

Uh oh!

SparkQA commented May 12, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

liancheng commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!