Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR tries to accelerate Parquet schema discovery and HadoopFsRelation partition discovery. The acceleration is done by the following means:

  • Turning off schema merging by default

    Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.

  • Avoiding FileSystem.globStatus() call when possible

    FileSystem.globStatus() may issue multiple synchronous RPC calls, and can be very slow (esp. on S3). This PR adds SparkHadoopUtil.globPathIfNecessary(), which only issues RPC calls when the path contain glob-pattern specific character(s) ({}[]*?\).

    This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a globStatus call on each input path sequentially.

  • Listing leaf files in parallel when the number of input paths exceeds a threshold

    Listing leaf files is required by partition discovery. Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each FileSystem.listStatus() call issues an RPC. In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.

    The threshold is controlled by SQLConf option spark.sql.sources.parallelPartitionDiscovery.threshold, which defaults to 32.

  • Discovering Parquet schema in parallel

    Currently, schema merging is also done on driver side, and needs to read footers of all part-files. This PR uses a Spark job to do schema merging. Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.

@liancheng
Copy link
Contributor Author

cc @marmbrus

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37231 has finished for PR 7396 at commit e2d07af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Least(children: Seq[Expression]) extends Expression
    • case class Greatest(children: Seq[Expression]) extends Expression
    • case class FakeFileStatus(

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37232 has finished for PR 7396 at commit 62ac68b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Least(children: Seq[Expression]) extends Expression
    • case class Greatest(children: Seq[Expression]) extends Expression
    • case class FakeFileStatus(

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37233 has finished for PR 7396 at commit 6ad83e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Least(children: Seq[Expression]) extends Expression
    • case class Greatest(children: Seq[Expression]) extends Expression
    • case class FakeFileStatus(

@liancheng
Copy link
Contributor Author

One of the test failure above is legitimate, which was caused by making mergeSchema to false by default, fixed. Others looked like to be affected by #7216.

@liancheng
Copy link
Contributor Author

retest please

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37287 has finished for PR 7396 at commit f35db47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FakeFileStatus(

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37325 has finished for PR 7396 at commit f122f10.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FakeFileStatus(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a per relation option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one. Defined in object ParquetRelation2, named mergeSchema.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37441 has finished for PR 7396 at commit f122f10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FakeFileStatus(

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37492 has finished for PR 7396 at commit 5598efc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FakeFileStatus(

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37490 has finished for PR 7396 at commit ff32cd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FakeFileStatus(

@marmbrus
Copy link
Contributor

Thanks, merging to master!

@asfgit asfgit closed this in a1064df Jul 20, 2015
@liancheng liancheng deleted the accel-parquet branch July 21, 2015 00:04
liancheng added a commit to liancheng/spark that referenced this pull request Jul 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants