-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19352][SQL] Preserve sort order when saving dataset if data is sorted by partition columns #16724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19352][SQL] Preserve sort order when saving dataset if data is sorted by partition columns #16724
Conversation
93d3806 to
b84c08b
Compare
|
Test build #72096 has finished for PR 16724 at commit
|
b84c08b to
aaa3c3d
Compare
|
Test build #72097 has finished for PR 16724 at commit
|
|
Test build #72100 has finished for PR 16724 at commit
|
88a9e72 to
c341cfa
Compare
|
Test build #72106 has finished for PR 16724 at commit
|
c341cfa to
3c040b6
Compare
|
Test build #72116 has finished for PR 16724 at commit
|
|
retest this please. |
|
Test build #72119 has finished for PR 16724 at commit
|
|
I don't think we should respect the ordering of input data in the writer. The current However, there is an opportunity for optimization: when the data is already partitioned, the writer doesn't need to sort the data by partition columns anymore. |
|
A data save API which doesn't respect the ordering of data sounds strange for me. If this is true, then, extremely said, we can remove the final Sort operator if any, when we want to output the data. Besides, you can't perform a simple task like "write out the csv files for customers sorted by buying amount in each city". Actually, when the data is not partitioned, then saving sorted data will keep the ordering, as we don't do the external sorting, if I think it correctly. For end-users, it might be hard for them to figure out why saving partitioned and sorted data can't keep the ordering. |
|
The point of optimization is good for me. I will create another JIRA/PR for it if the current change is not considered to merge in the end. |
|
I think it's hard to reason about how/when to preserve the data ordering when writing, considering partitioning and bucketing. For now it looks like the ordering should be preserved if the writer doesn't specify partitioning and bucketing, or the specified partitioning is same as the input data. If we implement the optimization mentioned above, then we can preserve the ordering if the specified partitioning is same as the input data. |
|
My original use case for sorting the output files based on timestamp using Spark was to use the output files with some other machine learning framework which might not readily work well with very large data files, like TensorFlow or Theano. The benefit that I was trying to get was to offload the sorting to Spark, since even if I ended up with large CSV files I could potentially mmap the CSV files to be used with the subsequent frameworks (TF/Theano). I thought this could be a relatively common use case, but from the impressions I'm getting from this discussion, I wonder if this is not a paradigm that Spark supports or encourages? |
|
@cloud-fan Yeah, I think it is reasonable that the ordering should be preserved if the writer doesn't specify partitioning and bucketing (we already did this now), or the specified partitioning is same as the input data (currently we don't do this). It is reasonable is because bucketing can be thought as a kind of data partition. And the data ordering after a data partition can't be guaranteed, I think. @igozali I think you don't need to bucket the data in each partition, right? |
…fter-external-sorter
|
@cloud-fan Even the specified partitioning is same as the input data, we still need the sorting. Because the rows with same partition values are not guaranteed to be continuous in all rows, so you can't write all rows of same partition values at once with an outputwriter. |
|
oh good catch. Then it seems like So there is no way to write out partitioned sorted data currently, @viirya can you think of a workaround? Adding a new API maybe not a good idea |
|
@cloud-fan Is the current change not suitable? We can change it to only preserve data order when specifying partitioning and no bucketing for the output. This change only adds a new constructor to |
|
If we admit that preserving the sort order is not guaranteed by the API, then the change in this PR is not reasonable, as it has performance penalty. |
|
@cloud-fan OK. I see. If we don't want to add implicit penalty into the existing API, the only way I can think now, is a config to preserve the sort order. This config can be in E.g., |
Just realized |
I may not understand you correctly. The result sets should be partitioned by "userId" and sorted by "timestamp". But in each partition, the rows with the same "userId" are not continuous. But we want the rows with the same "userId" are continuous in each partition and their |
|
|
Oh, yes. I miss looking... So you recommend that we only optimize this case to preserve the sort order. Sounds good. |
|
Test build #72751 has finished for PR 16724 at commit
|
| description.dataColumns, description.allColumns) | ||
|
|
||
| override def execute(iter: Iterator[InternalRow]): Set[String] = { | ||
| val outputOrderingExprs = description.outputOrdering.map(_.child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duplicates too much code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly is because there are two types of iterators, one is [UnsafeRow, UnsafeRow], another is just [UnsafeRow].
| } | ||
| } | ||
|
|
||
| test("SPARK-19352: Keep sort order of rows after external sorter when writing") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this is not guaranteed, we should not test it.
This is an optimization and advanced users can leverage this to preserve the sort order, but it may change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
|
@cloud-fan ok. I will look into it. |
|
Close this in favor of #16898. |
## What changes were proposed in this pull request?
In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed.
However, if the data is already sorted, we will sort it again, which is unnecssary.
This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted.
## How was this patch tested?
I did a micro benchmark manually
```
val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part")
spark.time(df.write.partitionBy("part").parquet("/tmp/test"))
```
The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards.
close apache#16724
Author: Wenchen Fan <[email protected]>
Closes apache#16898 from cloud-fan/writer.
## What changes were proposed in this pull request?
In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed.
However, if the data is already sorted, we will sort it again, which is unnecssary.
This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted.
## How was this patch tested?
I did a micro benchmark manually
```
val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part")
spark.time(df.write.partitionBy("part").parquet("/tmp/test"))
```
The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards.
close apache#16724
Author: Wenchen Fan <[email protected]>
Closes apache#16898 from cloud-fan/writer.
In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed.
However, if the data is already sorted, we will sort it again, which is unnecssary.
This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted.
I did a micro benchmark manually
```
val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part")
spark.time(df.write.partitionBy("part").parquet("/tmp/test"))
```
The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards.
close apache#16724
Author: Wenchen Fan <[email protected]>
Closes apache#16898 from cloud-fan/writer.
What changes were proposed in this pull request?
As reported, there is a sorting issue on relatively big datasets. This issue is not seen when using a smaller sized dataset.
For example, when trying to save a sorted dataset like:
On relatively big datasets, the data will not be sorted anymore after writing out.
That is because when we write partitioned dataset, we will use
UnsafeKVExternalSorterto sort data before writing. It sorts data based on partition columns + bucketId + sortBy columns. The sort ordering of the data to write is not considered.Even in the above case where there is not bucket and sortBy, and data is already partitioned by "userId". In theory we won't change the order of data, but we still can't get correctly sorted data in output files on relatively big datasets. That is because if the data is big enough,
UnsafeKVExternalSorterwill spill data to files. When we merge the spilled files, we will interlace the readers of spilled files. Thus the order of data in the spilled files is not guaranteed, if we don't explicitly ask to sort on specified order.This change extends `UnsafeKVExternalSorter` to accept a `List[SortOrder]` parameter used to control how the keys are sorted. Then it considers the output ordering of the data to write when constructing the `UnsafeKVExternalSorter`.Note: one thing needs to discuss is when we have sortBy column with bucket, do we need to keep output sort ordering and how to do it? Should we sort by partition columns + bucketId + data sort ordering + sortBy columns? Or partition columns + bucketId + sort columns + data sort ordering?
With the goal not to change or add implicit penalty to existing API, this patch solves the issue of preserving sort order of written data. After this patch, you can save sorted and partitioned data as:
Once the data to write is already sorted by the partition columns and no bucketing is specified,
FileFormatWriterwill not sort the data again. And the sort order is preserved in output files.How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request