-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18021][SQL] Refactor file name specification for data sources #15562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @liancheng @cloud-fan @tejasapatil also @marmbrus -- you have always wanted this to happen. |
|
Test build #67241 has finished for PR 15562 at commit
|
| private val recordWriter: RecordWriter[Void, InternalRow] = { | ||
| val outputFormat = { | ||
| new ParquetOutputFormat[InternalRow]() { | ||
| // Here we override `getDefaultWorkFile` for two reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the only contract now is that prefix needs to be enforced, and it is not the concern of these classes to think about dynamic partitioning or appending.
|
LGTM |
|
Test build #67244 has finished for PR 15562 at commit
|
liancheng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending Jenkins. Left one minor comment though.
| // implementations may use this UUID to generate unique file names (e.g., | ||
| // `part-r-<task-id>-<job-uuid>.parquet`). The reason why this ID is used to identify a job | ||
| // rather than a single task output file is that, speculative tasks must generate the same | ||
| // output file name as the original task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably preserve this comment and move it to the new place where we generate the UUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually there already in WriteJobDescription. I shortened it to a single line.
|
Test build #67273 has finished for PR 15562 at commit
|
|
Alright merging this in master. I will submit more patches to continue this work. |
## What changes were proposed in this pull request? Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing. On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions. This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically. There are also some other minor cleanups: - Removed the UUID passed through generic Configuration string - Some minor rewrites for better clarity - Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning ## How was this patch tested? This should be covered by existing data source tests. Author: Reynold Xin <[email protected]> Closes apache#15562 from rxin/SPARK-18021.
## What changes were proposed in this pull request? Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing. On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions. This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically. There are also some other minor cleanups: - Removed the UUID passed through generic Configuration string - Some minor rewrites for better clarity - Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning ## How was this patch tested? This should be covered by existing data source tests. Author: Reynold Xin <[email protected]> Closes apache#15562 from rxin/SPARK-18021.
What changes were proposed in this pull request?
Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing.
On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions.
This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically.
There are also some other minor cleanups:
How was this patch tested?
This should be covered by existing data source tests.