-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray. Change CartesianProductExec, SortMergeJoin, WindowExec to use it #16909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
@rxin : can you please recommend someone who could review this PR ? |
|
Test build #72806 has started for PR 16909 at commit |
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil I have taken a quick pass. It looks good overall. I think it might be a good idea to do some benchmarking to get an idea of potential performance implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems tricky. Lets move this cast to the call site (and preferably avoid it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved cast to client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have a version that skips n elements? This will be a little more efficient for the unbounded following window case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make an internal configuration in SQLConf for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a SQLConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only an issue after we have cleared the buffer right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is more than that. Apart from clear(), even add() would be regarded as a modification.
The inbuilt iterators (from both ArrayBuffer and UnsafeExternalSorter) once created do not see any new data added. Clients of ExternalAppendOnlyUnsafeRowArray may not realise this and think that they have read all the data. To prevent that, I have to have such mechanism to invalidate existing iterators.
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
val iterator = buffer.iterator
assert(iterator.hasNext)
assert(iterator.next() == 1)
buffer.append(2)
assert(iterator.hasNext) // <-- THIS FAILS
Also, when add() transparently switches the backing storage from ArrayBuffer => UnsafeExternalSorter and there was an open iterator created over the ArrayBuffer, it will lead to IndexOutOfBoundsException
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
buffer.append(2)
val iterator = buffer.iterator
buffer.clear()
assert(iterator.hasNext) // <-- THIS FAILS
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I changed to not use inbuilt iterator from ArrayBuffer and instead use a counter to iterate over the array. However I still depend on inbuilt iterator of UnsafeExternalSorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is better just to allocate an array, but that might be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention behind not using raw Array is to avoid holding that memory (if we go this route, one would have to set the spill threshold to a relatively lower value to avoid potential wastage of memory).
Before this PR:
- Initial size:
SortMergeJoinstarted off withArrayBufferof default size (ie. 16)WindowExecstarted off with emptyArrayBuffer
- For both the cases, there was no shrinking of the array so memory is not reclaimed until the operator finishes.
Proposed change:
- I am switching to
new ArrayBuffer(128)for both cases in order to init with decent size and not start with an empty array. Allocating space for 128 entries upfront is trivial memory footprint. - Keeping the "no shrinking" behavior same. A part of me thinks I could do something smarter by shrinking based on running average of actual lengths of the array, but it might be over-optimization. I will first focus on getting the basic stuff in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also move this configuration in SQLConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place it in a def/lazy cal if we need it more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just drop row buffer in favor of ExternalAppendOnlyUnsafeRowArray it doesn't make a lot of sense to keep this around. We just need a generateIterator(offset) for the unbounded following case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed RowBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @tejasapatil .
checkIfValueExists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to here, total additions -> total modifications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention here, but the log message looks a little bit misleading to me because we are already using ExternalAppendOnlyUnsafeRowArray. Also, technically, we are switching to UnsafeExternalSorter under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typo. Thanks for pointing this out. Corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var -> val?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numFieldsPerRow for consistency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
|
Retest this please |
|
Test build #72827 has finished for PR 16909 at commit
|
|
@tejasapatil Do you want to fix the BufferedRowIterator for WholeStageCodegenExec as well? As for inner join, the LinkedList currentRows would cause the same issue as it buffer the rows from inner join, and takes more memory (probably double if left and right has similar size). Also they can share the similar iterator data structure. |
|
@hvanhovell @davies Correct me if I am wrong. My understanding is that following code will go though all matching rows on the right side, and put them into the BufferedRowIterator. If there is OOM caused by ArrayList matches in SortMergeJoinExec, the memory usage will be doubled in currentRows in BufferedRowIterator (assuming the left and right have the same size). There are two way to solve it. One is to consume one row at a time, and the other one is make BufferedRowIterator spillable. BTW, I used to see that the memory consumption in BufferedRowIterator caused OOM along with matches in SortMergeJoinExec. The former took much more heap memory. |
|
The problem pointed out by @zhzhan is legit problem and I have seen that personally. However, I would scope out this PR to 2 use cases and handle other cases separately. |
tejasapatil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review comments @hvanhovell and @dongjoon-hyun. I have made the changes.
@hvanhovell : Wrt. potential performance implications, the updated PR has a benchmark class with numbers. It's a isolated micro-benchmark comparing the new array impl with ArrayBuffer. In order to see how much this actually adds up at a query level, I am doing an integration test over a large dataset by running SMB join with this PR and comparing it against version without this PR. Will get back with those numbers in sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is more than that. Apart from clear(), even add() would be regarded as a modification.
The inbuilt iterators (from both ArrayBuffer and UnsafeExternalSorter) once created do not see any new data added. Clients of ExternalAppendOnlyUnsafeRowArray may not realise this and think that they have read all the data. To prevent that, I have to have such mechanism to invalidate existing iterators.
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
val iterator = buffer.iterator
assert(iterator.hasNext)
assert(iterator.next() == 1)
buffer.append(2)
assert(iterator.hasNext) // <-- THIS FAILS
Also, when add() transparently switches the backing storage from ArrayBuffer => UnsafeExternalSorter and there was an open iterator created over the ArrayBuffer, it will lead to IndexOutOfBoundsException
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
buffer.append(2)
val iterator = buffer.iterator
buffer.clear()
assert(iterator.hasNext) // <-- THIS FAILS
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved cast to client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention behind not using raw Array is to avoid holding that memory (if we go this route, one would have to set the spill threshold to a relatively lower value to avoid potential wastage of memory).
Before this PR:
- Initial size:
SortMergeJoinstarted off withArrayBufferof default size (ie. 16)WindowExecstarted off with emptyArrayBuffer
- For both the cases, there was no shrinking of the array so memory is not reclaimed until the operator finishes.
Proposed change:
- I am switching to
new ArrayBuffer(128)for both cases in order to init with decent size and not start with an empty array. Allocating space for 128 entries upfront is trivial memory footprint. - Keeping the "no shrinking" behavior same. A part of me thinks I could do something smarter by shrinking based on running average of actual lengths of the array, but it might be over-optimization. I will first focus on getting the basic stuff in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a SQLConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed RowBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review : will change this to generateIterator(startIndex = 0)
|
Test build #72918 has finished for PR 16909 at commit
|
a4c4416 to
e653171
Compare
|
Test build #72920 has started for PR 16909 at commit |
|
@hvanhovell : Re performance: I ran a SMB join query over two real world tables (2 trillion rows (40 TB) and 6 million rows (120 GB)). Note that this dataset does not have skew so no spill happened. I saw improvement in CPU time by 2-4% over version without this PR. This did not add up as I was expected some regression. I think allocating array of capacity of 128 at the start (instead of starting with default size 16) is the sole reason for the perf. gain : https://github.com/tejasapatil/spark/blob/SPARK-13450_smb_buffer_oom/sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala#L43 . I could remove that and rerun, but effectively the change will be deployed in this form and I wanted to see the effect of it over large workload. For micro-benchmarking, I took care of making the test case allocate array buffers of same size to avoid this from interfering the results. The numbers there look sane as there is some regression from using the spillable array. I see testcases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare to use ExternalUnsafeSorter directly? There is a similar use case here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProductExec.scala#L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies :
- See PR description with comparison results.
- Updated the PR to include
CartesianProductExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comparison results, ExternalUnsafeSorter performs slightly better? If so, any reason not to use ExternalUnsafeSorter directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. My original guess before writing benchmark was ArrayBuffer would be superior to ExternalUnsafeSorter so having a impl which would initially behave like ArrayBuffer but later switch to ExternalUnsafeSorter was what I went with. I won't make this call solely based on the micro benchmark results as it might not reflect what acutally happens when a query runs because there are other operations that happen while the buffer is populated and accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I created a test version by removing the ArrayBuffer and let alone ExternalUnsafeSorter be used to back the data. Over prod queries, this turned out to be slightly slower unlike what the micro-benchmark results showed. I will stick with the current approach.
|
ok to test |
|
retest this please |
|
Test build #72941 has finished for PR 16909 at commit
|
1494c83 to
5b73e2b
Compare
|
Test build #72963 has finished for PR 16909 at commit
|
|
Test build #72964 has finished for PR 16909 at commit
|
…ernalAppendOnlyUnsafeRowArray` against using raw `UnsafeExternalSorter`
3ae834a to
23acc3f
Compare
|
Test build #74603 has finished for PR 16909 at commit
|
|
LGTM - merging to master. Thanks! |
| var i = 0 | ||
| while (i < numFrames) { | ||
| frames(i).prepare(rowBuffer.copy()) | ||
| frames(i).prepare(buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No copy?
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <[email protected]> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee) Signed-off-by: Wenchen Fan <[email protected]>
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by apache#16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <[email protected]> Closes apache#18470 from hvanhovell/SPARK-21258.
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <[email protected]> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee) Signed-off-by: Wenchen Fan <[email protected]>
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre #16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <[email protected]> Closes #18843 from tejasapatil/SPARK-21595. (cherry picked from commit 9443999) Signed-off-by: Herman van Hovell <[email protected]>
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <[email protected]> Closes apache#18843 from tejasapatil/SPARK-21595.
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <[email protected]> Closes apache#18843 from tejasapatil/SPARK-21595.
|
I have one question regarding this change. @tejasapatil thanks, |
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <[email protected]> Closes apache#18843 from tejasapatil/SPARK-21595. (cherry picked from commit 9443999) Signed-off-by: Herman van Hovell <[email protected]>
|
Hi @sheperdh , the PR author does make a brief note about 'Full Outer Joins' in the PR description.
Apparently, this separate PR is pending. |
What issue does this PR address ?
Jira: https://issues.apache.org/jira/browse/SPARK-13450
In
SortMergeJoinExec, rows of the right relation having the same value for a join key are buffered in-memory. In case of skew, this causes OOMs (see comments in SPARK-13450 for more details). Heap dump from a failed job confirms this : https://issues.apache.org/jira/secure/attachment/12846382/heap-dump-analysis.png . While its possible to increase the heap size to workaround, Spark should be resilient to such issues as skews can happen arbitrarily.Change proposed in this pull request
ExternalAppendOnlyUnsafeRowArrayUnsafeRows in-memory upto a certain threshold.UnsafeExternalSorterwhich enables spilling of the rows to disk. It does NOT sort the data.addorclear) will invalidate the existing iterator(s)WindowExecwas already usingUnsafeExternalSorterto support spilling. Changed it to use the new arraySortMergeJoinExecto use the new array implementationCartesianProductExecto use the new array implementationNote for reviewers
The diff can be divided into 3 parts. My motive behind having all the changes in a single PR was to demonstrate that the API is sane and supports 2 use cases. If reviewing as 3 separate PRs would help, I am happy to make the split.
How was this patch tested ?
Unit testing
ExternalAppendOnlyUnsafeRowArrayto validate all its APIs and access patternsSortMergeExecSortMergeExecfor all joins (except FULL OUTER JOIN). However, I expect existing test cases to cover that there is no regression in correctness.WindowExecto check behavior of spilling and correctness of results.Stress testing
Generating the synthetic data
Ran this against trunk VS local build with this PR. OOM repros with trunk and with the fix this query runs fine.
Performance comparison
Macro-benchmark
I ran a SMB join query over two real world tables (2 trillion rows (40 TB) and 6 million rows (120 GB)). Note that this dataset does not have skew so no spill happened. I saw improvement in CPU time by 2-4% over version without this PR. This did not add up as I was expected some regression. I think allocating array of capacity of 128 at the start (instead of starting with default size 16) is the sole reason for the perf. gain : https://github.com/tejasapatil/spark/blob/SPARK-13450_smb_buffer_oom/sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala#L43 . I could remove that and rerun, but effectively the change will be deployed in this form and I wanted to see the effect of it over large workload.
Micro-benchmark
Two types of benchmarking can be found in
ExternalAppendOnlyUnsafeRowArrayBenchmark:[A] Comparing
ExternalAppendOnlyUnsafeRowArrayagainst rawArrayBufferwhen all rows fit in-memory and there is no spill[B] Comparing
ExternalAppendOnlyUnsafeRowArrayagainst rawUnsafeExternalSorterwhen there is spilling of data