-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17528][SQL] MutableProjection should not cache content from the input row #15082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is duplicated 4 times, should we introduce a trait for it? (I'm not able to create a util function for it because these codes differ a little)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move this into a util function right:
def copyValue(value: Any): Any = value match {
case v: UTF8String => v.clone()
case v: InternalRow => v.copy()
case v: ArrayData => v.copy()
case v: MapData => v.copy()
case v => v
}|
Is this a "wrong answer" correctness bug? If so, let me label accordingly on the JIRA. |
|
Test build #65328 has finished for PR 15082 at commit
|
|
@JoshRosen I think this is a potential bug (not now). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If $row could do the copy inside update, then we do need to do the copy here, right?
Maybe it's time to check all the MutableRow, MutableProjection, to see where is the best place to do the copy.
|
After thinking more about it, it seems not a problem and we don't need to fix it. Currently
This PR tried to fix |
|
My understanding of the main concern of closing this PR is that:
|
|
After some more discussion, I'm going to reopen it:
|
|
Test build #65386 has finished for PR 15082 at commit
|
| externalRow should be theSameInstanceAs externalRow.copy() | ||
| } | ||
|
|
||
| it("copy should return same ref for internal rows") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also fix the external row? Why copy should return same ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy returned the same ref because it is supposed to be immutable. See #10553 for more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got it, but it's not true for internal row right? It can be mutable so it's safe to remove this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well MutableRow is mutable, so it shouldn't hold for those. The only exception is GenericInternalRow.
That being said, I don't mind if you remove/modify the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as InternalRow can have mutable implementation, InternalRow is not immutable anymore, because it can have a struct field, whose value can be a MutableRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline: I don't see the point of having an immutable GenericInternalRow if we cannot guarantee its immutability. We could just make every InternalRow a mutable one, and simplify the class structure in the process. I am not sure if we should make that part of the current PR though.
|
Test build #65418 has finished for PR 15082 at commit
|
|
retest this please |
|
Test build #65561 has finished for PR 15082 at commit
|
|
What's the status of this issue? I see that it's currently targeted for 2.0.1 in JIRA and wanted to ping to make sure that this doesn't miss the cut in case we prepare an RC soon. |
|
I re-targeted it to 2.1 only. |
|
Test build #65807 has finished for PR 15082 at commit
|
What changes were proposed in this pull request?
For performance reasons,
UnsafeRow.getString,getStruct, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance.When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime.
However,
GenerateMutableProjectionbreaks this and may cache the content in input rows. The sort based aggregate use it, but this bug is not exposed because sort based aggregate always do an extra projection for the input row.This PR fixes the bug of
GenerateMutableProjectionand some related bugs in complex data copy, and remove the useless projection in sort based aggregate.How was this patch tested?
some new tests.