Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Dec 2, 2016

What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. spark.read.parquet(topLevelDir)), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc @mallman @cloud-fan

How was this patch tested?

Checked metrics in unit tests.

@ericl ericl changed the title [SPARK-18769] [SQL] Fix regression in file listing performance for non-catalog tables [SPARK-18679] [SQL] Fix regression in file listing performance for non-catalog tables Dec 2, 2016
}

test("PartitioningAwareFileIndex listing parallelized with many top level dirs") {
for ((scale, expectedNumPar) <- Seq((10, 0), (50, 1))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do withSQLConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD -> "xxx") { test code } to make the test more robust?

@cloud-fan
Copy link
Contributor

LGTM, @ericl have you run some local benchmark to make sure the performance regression is fixed?

@ericl
Copy link
Contributor Author

ericl commented Dec 2, 2016 via email

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69531 has finished for PR 16112 at commit db66439.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 2, 2016
…-catalog tables

## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <[email protected]>

Closes #16112 from ericl/spark-18679.

(cherry picked from commit 294163e)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1!

@asfgit asfgit closed this in 294163e Dec 2, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…-catalog tables

## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <[email protected]>

Closes apache#16112 from ericl/spark-18679.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…-catalog tables

## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <[email protected]>

Closes apache#16112 from ericl/spark-18679.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants