-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18403][SQL] Fix unsafe data false sharing issue in ObjectHashAggregateExec #15976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68982 has started for PR 15976 at commit |
| } | ||
|
|
||
| doubleSafeCheckRows(actual1, expected, 1e-4) | ||
| doubleSafeCheckRows(actual2, expected, 1e-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
cc @yhuai @cloud-fan |
|
Thank you for fixing this, @liancheng ! |
|
Test build #69016 has finished for PR 15976 at commit
|
|
retest this please |
|
The last build failure was caused by irrelevant YARN tests. |
|
Test build #69027 has finished for PR 15976 at commit
|
|
Also cc @davies and @sameeragarwal. |
|
A similar alternative fix @yhuai proposed is to convert the underlying |
| processRow(result.aggregationBuffer, inputIterator.getValue) | ||
| // Since `inputIterator.getValue` is an `UnsafeRow` whose underlying buffer will be | ||
| // overwritten when `inputIterator` steps forward, we need to do a deep copy here. | ||
| processRow(result.aggregationBuffer, inputIterator.getValue.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is, during processRow we cache the input row somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's caused by MutableProjection? As MutableProjection may keep an "pointer" that points to a memory region of an unsafe row. Maybe we can fix this bug by #15082?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, #15082 needs some significant refactor, we should get this fix in 2.1 first.
| // 3. Sort-based aggregation fallback must be triggered during evaluation. | ||
| withSQLConf( | ||
| SQLConf.USE_OBJECT_HASH_AGG.key -> "true", | ||
| SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but the config name looks weird, how about OBJECT_AGG_FALLBACK_TO_SORT_THRESHOLD
|
retest this please |
|
Test build #69224 has started for PR 15976 at commit |
|
Retest this please |
|
Test build #69234 has finished for PR 15976 at commit
|
|
LGTM |
|
thanks, merging to master! |
|
@cloud-fan @dongjoon-hyun Thanks for the review! |
…ggregateExec ## What changes were proposed in this pull request? This PR fixes a random OOM issue occurred while running `ObjectHashAggregateSuite`. This issue can be steadily reproduced under the following conditions: 1. The aggregation must be evaluated using `ObjectHashAggregateExec`; 2. There must be an input column whose data type involves `ArrayType` (an input column of `MapType` may even cause SIGSEGV); 3. Sort-based aggregation fallback must be triggered during evaluation. The root cause is that while falling back to sort-based aggregation, we must sort and feed already evaluated partial aggregation buffers living in the hash map to the sort-based aggregator using an external sorter. However, the underlying mutable byte buffer of `UnsafeRow`s produced by the iterator of the external sorter is reused and may get overwritten when the iterator steps forward. After the last entry is consumed, the byte buffer points to a block of uninitialized memory filled by `5a`. Therefore, while reading an `UnsafeArrayData` out of the `UnsafeRow`, `5a5a5a5a` is treated as array size and triggers a memory allocation for a ridiculously large array and immediately blows up the JVM with an OOM. To fix this issue, we only need to add `.copy()` accordingly. ## How was this patch tested? New regression test case added in `ObjectHashAggregateSuite`. Author: Cheng Lian <[email protected]> Closes apache#15976 from liancheng/investigate-oom.
…ggregateExec ## What changes were proposed in this pull request? This PR fixes a random OOM issue occurred while running `ObjectHashAggregateSuite`. This issue can be steadily reproduced under the following conditions: 1. The aggregation must be evaluated using `ObjectHashAggregateExec`; 2. There must be an input column whose data type involves `ArrayType` (an input column of `MapType` may even cause SIGSEGV); 3. Sort-based aggregation fallback must be triggered during evaluation. The root cause is that while falling back to sort-based aggregation, we must sort and feed already evaluated partial aggregation buffers living in the hash map to the sort-based aggregator using an external sorter. However, the underlying mutable byte buffer of `UnsafeRow`s produced by the iterator of the external sorter is reused and may get overwritten when the iterator steps forward. After the last entry is consumed, the byte buffer points to a block of uninitialized memory filled by `5a`. Therefore, while reading an `UnsafeArrayData` out of the `UnsafeRow`, `5a5a5a5a` is treated as array size and triggers a memory allocation for a ridiculously large array and immediately blows up the JVM with an OOM. To fix this issue, we only need to add `.copy()` accordingly. ## How was this patch tested? New regression test case added in `ObjectHashAggregateSuite`. Author: Cheng Lian <[email protected]> Closes apache#15976 from liancheng/investigate-oom.
What changes were proposed in this pull request?
This PR fixes a random OOM issue occurred while running
ObjectHashAggregateSuite.This issue can be steadily reproduced under the following conditions:
ObjectHashAggregateExec;ArrayType(an input column ofMapTypemay even cause SIGSEGV);The root cause is that while falling back to sort-based aggregation, we must sort and feed already evaluated partial aggregation buffers living in the hash map to the sort-based aggregator using an external sorter. However, the underlying mutable byte buffer of
UnsafeRows produced by the iterator of the external sorter is reused and may get overwritten when the iterator steps forward. After the last entry is consumed, the byte buffer points to a block of uninitialized memory filled by5a. Therefore, while reading anUnsafeArrayDataout of theUnsafeRow,5a5a5a5ais treated as array size and triggers a memory allocation for a ridiculously large array and immediately blows up the JVM with an OOM.To fix this issue, we only need to add
.copy()accordingly.How was this patch tested?
New regression test case added in
ObjectHashAggregateSuite.