-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18679] [SQL] Fix regression in file listing performance for non-catalog tables #16112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| test("PartitioningAwareFileIndex listing parallelized with many top level dirs") { | ||
| for ((scale, expectedNumPar) <- Seq((10, 0), (50, 1))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do withSQLConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD -> "xxx") { test code } to make the test more robust?
|
LGTM, @ericl have you run some local benchmark to make sure the performance regression is fixed? |
|
Yep
…On Thu, Dec 1, 2016, 8:03 PM Wenchen Fan ***@***.***> wrote:
LGTM, @ericl <https://github.com/ericl> have you run some local benchmark
to make sure the performance regression is fixed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16112 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAA6SojMWfKRhJ0p_h66YyQZNUkIpYaEks5rD5iJgaJpZM4LCIzn>
.
|
|
Test build #69531 has finished for PR 16112 at commit
|
…-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <[email protected]> Closes #16112 from ericl/spark-18679. (cherry picked from commit 294163e) Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.1! |
…-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <[email protected]> Closes apache#16112 from ericl/spark-18679.
…-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <[email protected]> Closes apache#16112 from ericl/spark-18679.
What changes were proposed in this pull request?
In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g.
spark.read.parquet(topLevelDir)), the top of the tree is only a single directory.This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).
cc @mallman @cloud-fan
How was this patch tested?
Checked metrics in unit tests.