Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented May 3, 2016

What changes were proposed in this pull request?

File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.

This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.

  • HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
  • StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
  • The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.

How was this patch tested?

  • FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
  • Other unit tests are unchanged and pass as expected.

@tdas
Copy link
Contributor Author

tdas commented May 3, 2016

@marmbrus @yhuai

@tdas tdas changed the title [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog infer partitioning [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning May 3, 2016

override def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path)

protected def inferPartitioning(): PartitionSpec = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus These methods are basically unchanged, just moved around from HDFSFileCatalog to PartitioningAwareFileCatalog.

@marmbrus
Copy link
Contributor

marmbrus commented May 3, 2016

/cc @liancheng @cloud-fan

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57672 has finished for PR 12879 at commit e80483e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57674 has finished for PR 12879 at commit e8639ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* A [[FileCatalog]] that generates the list of files to processing by reading them from the
* metadata log files generated by the [[FileStreamSink]].
*/
class MetadataLogFileCatalog(sparkSession: SparkSession, path: Path)
Copy link
Contributor Author

@tdas tdas May 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is basically StreamFileCatalog.scala renamed. Github not showing it as a rename.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57690 has finished for PR 12879 at commit 973847c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57704 has finished for PR 12879 at commit 84864e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57706 has finished for PR 12879 at commit 9e62ccf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented May 4, 2016

ping @marmbrus

@marmbrus
Copy link
Contributor

marmbrus commented May 4, 2016

LGTM

@tdas
Copy link
Contributor Author

tdas commented May 4, 2016

Thanks! Merging this to master and 2.0

asfgit pushed a commit that referenced this pull request May 4, 2016
…talog to infer partitioning

## What changes were proposed in this pull request?

File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.

This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
- HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
- StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
- The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.

## How was this patch tested?
- FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
- Other unit tests are unchanged and pass as expected.

Author: Tathagata Das <[email protected]>

Closes #12879 from tdas/SPARK-15103.

(cherry picked from commit 0fd3a47)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in 0fd3a47 May 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants