-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed #17093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…en set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed
|
Test build #73552 has finished for PR 17093 at commit
|
| } | ||
| } | ||
|
|
||
| test("InMemoryFileIndex with empty rootPaths when PARALLEL_PARTITION_DISCOVERY_THRESHOLD is 0") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this fix, when users setting it to -1, we still face the same strange error. We need a complete fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user should not set it to a negative number initiatively, if they set it , throw exception is reasonable
|
Test build #73568 has finished for PR 17093 at commit
|
|
Test build #73583 has finished for PR 17093 at commit
|
| sparkSession: SparkSession): Seq[(Path, Seq[FileStatus])] = { | ||
|
|
||
| // Short-circuits parallel listing when serial listing is likely to be faster. | ||
| if (paths.size < sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not we just make sure parallelPartitionDiscoveryThreshold is greater than 0? We can add a condition(via checkValue) in SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I didn't notice there is a checkValue func, let me fix it. thanks!
|
|
||
| // Short-circuits parallel listing when serial listing is likely to be faster. | ||
| if (paths.size < sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { | ||
| if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add equal is more clear to understand the conf parallelPartitionDiscoveryThreshold
| } | ||
|
|
||
| test("InMemoryFileIndex with empty rootPaths when PARALLEL_PARTITION_DISCOVERY_THRESHOLD" + | ||
| "is not positive number") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is negative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is zero value test in it, so I use not positive...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not positive number -> a nonpositive number
| "files with another Spark distributed job. This applies to Parquet, ORC, CSV, JSON and " + | ||
| "LibSVM data sources.") | ||
| .intConf | ||
| .checkValue(parallel => parallel >= 0, "The maximum number of files allowed for listing " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it should be The maximum number of root paths?
|
Test build #73633 has finished for PR 17093 at commit
|
|
Test build #73650 has finished for PR 17093 at commit
|
|
LGTM except #17093 (comment) |
|
Test build #73665 has finished for PR 17093 at commit
|
| } | ||
| }.getMessage | ||
| assert(e.contains("The maximum number of paths allowed for listing files at " + | ||
| "driver side must not be negative")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indent. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh...thanks~
|
Test build #73672 has finished for PR 17093 at commit
|
|
Test build #73673 has finished for PR 17093 at commit
|
|
Thanks! Merging to master. |
…ed to listFiles twice ## What changes were proposed in this pull request? Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles` twice in `InMemoryFileIndex` during `resolveRelation`. This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice. But there is a bug in `InMemoryFileIndex` see: [SPARK-19748](#17079) [SPARK-19761](#17093), so this pr should be after SPARK-19748/ SPARK-19761. ## How was this patch tested? unit test added Author: windpiger <[email protected]> Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
What changes were proposed in this pull request?
If we create a InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an exception:
How was this patch tested?
unit test added