[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery #7396

liancheng · 2015-07-14T15:11:09Z

This PR tries to accelerate Parquet schema discovery and HadoopFsRelation partition discovery. The acceleration is done by the following means:

Turning off schema merging by default

Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.
Avoiding FileSystem.globStatus() call when possible

FileSystem.globStatus() may issue multiple synchronous RPC calls, and can be very slow (esp. on S3). This PR adds SparkHadoopUtil.globPathIfNecessary(), which only issues RPC calls when the path contain glob-pattern specific character(s) ({}[]*?\).

This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a globStatus call on each input path sequentially.
Listing leaf files in parallel when the number of input paths exceeds a threshold

Listing leaf files is required by partition discovery. Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each FileSystem.listStatus() call issues an RPC. In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.

The threshold is controlled by SQLConf option spark.sql.sources.parallelPartitionDiscovery.threshold, which defaults to 32.
Discovering Parquet schema in parallel

Currently, schema merging is also done on driver side, and needs to read footers of all part-files. This PR uses a Spark job to do schema merging. Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.

liancheng · 2015-07-14T15:19:11Z

cc @marmbrus

SparkQA · 2015-07-14T17:22:44Z

Test build #37231 has finished for PR 7396 at commit e2d07af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression
- case class FakeFileStatus(

SparkQA · 2015-07-14T17:26:14Z

Test build #37232 has finished for PR 7396 at commit 62ac68b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression
- case class FakeFileStatus(

SparkQA · 2015-07-14T17:27:43Z

Test build #37233 has finished for PR 7396 at commit 6ad83e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression
- case class FakeFileStatus(

liancheng · 2015-07-14T23:03:57Z

One of the test failure above is legitimate, which was caused by making mergeSchema to false by default, fixed. Others looked like to be affected by #7216.

liancheng · 2015-07-14T23:40:09Z

retest please

liancheng · 2015-07-15T00:43:42Z

retest this please

SparkQA · 2015-07-15T02:41:41Z

Test build #37287 has finished for PR 7396 at commit f35db47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FakeFileStatus(

SparkQA · 2015-07-15T08:08:49Z

Test build #37325 has finished for PR 7396 at commit f122f10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FakeFileStatus(

marmbrus · 2015-07-15T19:44:00Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

Should this be a per relation option?

There is one. Defined in object ParquetRelation2, named mergeSchema.

liancheng · 2015-07-16T02:35:09Z

retest this please

SparkQA · 2015-07-16T04:45:53Z

Test build #37441 has finished for PR 7396 at commit f122f10.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FakeFileStatus(

Removes some dead code Parallelizes input paths listing

SparkQA · 2015-07-16T12:52:11Z

Test build #37492 has finished for PR 7396 at commit 5598efc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FakeFileStatus(

SparkQA · 2015-07-16T12:54:22Z

Test build #37490 has finished for PR 7396 at commit ff32cd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FakeFileStatus(

marmbrus · 2015-07-20T23:43:21Z

Thanks, merging to master!

liancheng force-pushed the accel-parquet branch from e2d07af to 62ac68b Compare July 14, 2015 15:19

marmbrus reviewed Jul 15, 2015
View reviewed changes

liancheng added 4 commits July 16, 2015 18:33

Moves schema merging to executor side

32e5f0d

Removes some dead code Parallelizes input paths listing

Should allow empty input paths

b1646aa

Fixes test failure caused by making "mergeSchema" default to "false"

3c580f1

Excludes directories while listing leaf files

ff32cd0

liancheng force-pushed the accel-parquet branch from f122f10 to ff32cd0 Compare July 16, 2015 10:33

Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row]

5598efc

liancheng mentioned this pull request Jul 16, 2015

[SPARK-9095] [SQL] Removes the old Parquet support #7441

Closed

liancheng mentioned this pull request Jul 19, 2015

[SPARK-8756][SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 #7154

Closed

asfgit closed this in a1064df Jul 20, 2015

marmbrus mentioned this pull request Jul 20, 2015

[SPARK-6910] [WiP] Reduce number of operations to the cluster. #7049

Closed

liancheng deleted the accel-parquet branch July 21, 2015 00:04

liancheng added a commit to liancheng/spark that referenced this pull request Jul 25, 2015

Backports PR apache#7396 to branch-1.4

821e28d

[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery #7396

[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery #7396

Uh oh!

Conversation

liancheng commented Jul 14, 2015

Uh oh!

liancheng commented Jul 14, 2015

Uh oh!

SparkQA commented Jul 14, 2015

Uh oh!

SparkQA commented Jul 14, 2015

Uh oh!

SparkQA commented Jul 14, 2015

Uh oh!

liancheng commented Jul 14, 2015

Uh oh!

liancheng commented Jul 14, 2015

Uh oh!

liancheng commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

marmbrus Jul 15, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Jul 15, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 16, 2015

Uh oh!

SparkQA commented Jul 16, 2015

Uh oh!

SparkQA commented Jul 16, 2015

Uh oh!

SparkQA commented Jul 16, 2015

Uh oh!

marmbrus commented Jul 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants