[SPARK-18021][SQL] Refactor file name specification for data sources #15562

rxin · 2016-10-20T06:19:36Z

What changes were proposed in this pull request?

Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing.

On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions.

This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically.

There are also some other minor cleanups:

Removed the UUID passed through generic Configuration string
Some minor rewrites for better clarity
Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning

How was this patch tested?

This should be covered by existing data source tests.

rxin · 2016-10-20T06:20:16Z

cc @liancheng @cloud-fan @tejasapatil

also @marmbrus -- you have always wanted this to happen.

SparkQA · 2016-10-20T06:29:20Z

Test build #67241 has finished for PR 15562 at commit 6b79d88.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TextOutputWriter(

cloud-fan · 2016-10-20T06:37:03Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriter.scala

  private val recordWriter: RecordWriter[Void, InternalRow] = {
    val outputFormat = {
      new ParquetOutputFormat[InternalRow]() {
-        // Here we override `getDefaultWorkFile` for two reasons:


why remove this comment?

Basically the only contract now is that prefix needs to be enforced, and it is not the concern of these classes to think about dynamic partitioning or appending.

cloud-fan · 2016-10-20T06:50:28Z

LGTM

SparkQA · 2016-10-20T08:11:18Z

Test build #67244 has finished for PR 15562 at commit 34b3cb1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng

LGTM pending Jenkins. Left one minor comment though.

liancheng · 2016-10-20T06:36:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

-    // implementations may use this UUID to generate unique file names (e.g.,
-    // `part-r-<task-id>-<job-uuid>.parquet`). The reason why this ID is used to identify a job
-    // rather than a single task output file is that, speculative tasks must generate the same
-    // output file name as the original task.


We should probably preserve this comment and move it to the new place where we generate the UUID.

It's actually there already in WriteJobDescription. I shortened it to a single line.

SparkQA · 2016-10-20T19:15:18Z

Test build #67273 has finished for PR 15562 at commit 7aaded1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-20T19:17:25Z

Alright merging this in master. I will submit more patches to continue this work.

## What changes were proposed in this pull request? Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing. On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions. This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically. There are also some other minor cleanups: - Removed the UUID passed through generic Configuration string - Some minor rewrites for better clarity - Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning ## How was this patch tested? This should be covered by existing data source tests. Author: Reynold Xin <[email protected]> Closes apache#15562 from rxin/SPARK-18021.

rxin added 2 commits October 19, 2016 22:29

[SPARK-18012][SQL] Simplify WriterContainer follow-up

426ed1f

[SPARK-18021][SQL] Refactor file name specification for data sources

6b79d88

rxin changed the title ~~Spark 18021~~ [SPARK-18021][SQL] Refactor file name specification for data sources Oct 20, 2016

rxin mentioned this pull request Oct 20, 2016

[SPARK-18012][SQL] Simplify WriterContainer follow-up #15561

Closed

cloud-fan reviewed Oct 20, 2016

View reviewed changes

Fix compilation

34b3cb1

liancheng reviewed Oct 20, 2016

View reviewed changes

rxin added 2 commits October 20, 2016 09:59

Merge branch 'master' into SPARK-18021

229682f

Remove the test case that is no longer valid.

7aaded1

asfgit closed this in 7f9ec19 Oct 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18021][SQL] Refactor file name specification for data sources #15562

[SPARK-18021][SQL] Refactor file name specification for data sources #15562

Uh oh!

rxin commented Oct 20, 2016 •

edited

Loading

Uh oh!

rxin commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

cloud-fan Oct 20, 2016

Uh oh!

rxin Oct 20, 2016

Uh oh!

cloud-fan commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

liancheng left a comment

Uh oh!

liancheng Oct 20, 2016

Uh oh!

rxin Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

rxin commented Oct 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-18021][SQL] Refactor file name specification for data sources #15562

[SPARK-18021][SQL] Refactor file name specification for data sources #15562

Uh oh!

Conversation

rxin commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

cloud-fan Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

liancheng left a comment

Choose a reason for hiding this comment

Uh oh!

liancheng Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

rxin commented Oct 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rxin commented Oct 20, 2016 •

edited

Loading