-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning #12879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| override def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path) | ||
|
|
||
| protected def inferPartitioning(): PartitionSpec = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus These methods are basically unchanged, just moved around from HDFSFileCatalog to PartitioningAwareFileCatalog.
|
/cc @liancheng @cloud-fan |
|
Test build #57672 has finished for PR 12879 at commit
|
|
Test build #57674 has finished for PR 12879 at commit
|
| * A [[FileCatalog]] that generates the list of files to processing by reading them from the | ||
| * metadata log files generated by the [[FileStreamSink]]. | ||
| */ | ||
| class MetadataLogFileCatalog(sparkSession: SparkSession, path: Path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is basically StreamFileCatalog.scala renamed. Github not showing it as a rename.
|
Test build #57690 has finished for PR 12879 at commit
|
|
Test build #57704 has finished for PR 12879 at commit
|
|
Test build #57706 has finished for PR 12879 at commit
|
|
ping @marmbrus |
|
LGTM |
|
Thanks! Merging this to master and 2.0 |
…talog to infer partitioning ## What changes were proposed in this pull request? File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog. This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files. - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning. - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log. - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala. ## How was this patch tested? - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query. - Other unit tests are unchanged and pass as expected. Author: Tathagata Das <[email protected]> Closes #12879 from tdas/SPARK-15103. (cherry picked from commit 0fd3a47) Signed-off-by: Tathagata Das <[email protected]>
What changes were proposed in this pull request?
File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.
This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
How was this patch tested?