Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

For performance reasons, UnsafeRow.getString, getStruct, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance.

When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime.

However, GenerateMutableProjection breaks this and may cache the content in input rows. The sort based aggregate use it, but this bug is not exposed because sort based aggregate always do an extra projection for the input row.

This PR fixes the bug of GenerateMutableProjection and some related bugs in complex data copy, and remove the useless projection in sort based aggregate.

How was this patch tested?

some new tests.

@cloud-fan
Copy link
Contributor Author

cc @davies @yhuai @liancheng @clockfly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is duplicated 4 times, should we introduce a trait for it? (I'm not able to create a util function for it because these codes differ a little)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this into a util function right:

def copyValue(value: Any): Any = value match {
  case v: UTF8String => v.clone()
  case v: InternalRow => v.copy()
  case v: ArrayData => v.copy()
  case v: MapData => v.copy()
  case v => v
}

@JoshRosen
Copy link
Contributor

Is this a "wrong answer" correctness bug? If so, let me label accordingly on the JIRA.

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65328 has finished for PR 15082 at commit 75b5749.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Sep 14, 2016

@JoshRosen I think this is a potential bug (not now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If $row could do the copy inside update, then we do need to do the copy here, right?

Maybe it's time to check all the MutableRow, MutableProjection, to see where is the best place to do the copy.

@cloud-fan
Copy link
Contributor Author

After thinking more about it, it seems not a problem and we don't need to fix it. Currently MutableProjection is used in 3 places:

  1. hash based aggregate. It's fine because hash aggregate only supports primitive type buffer, so it doesn't have the copy problem.
  2. sort based aggregate. It's also fine because we always apply a safe projection to the input row before process it. I have a PR to fix the comment of it: [SQL][minor] correct the comment of SortBasedAggregationIterator.safeProj #15095
  3. window execution. It's also fine because its input rows are come from RowBuffer, not unsafe rows produced by unsafe projection.

This PR tried to fix MutableProjection so that we can remove the safe projection in sort based aggregate, but I was wrong because MutableProjection is not the only way to update the aggregation buffer, ImperativeAggregate can also update the buffer via the update and merge API, and it's not a good idea to fix all ImperativeAggregate to do copy correctly while updating the aggregate buffer(and it's not future proof, we may add more ImperativeAggragate implementations and miss this requirement).

@cloud-fan cloud-fan closed this Sep 14, 2016
@liancheng
Copy link
Contributor

My understanding of the main concern of closing this PR is that:

  1. Although this issue can be potentially dangerous, the current code work fine without fixing this issue.
  2. We still can't remove the safe projection in SortAggregateExec after fixing this issue (as @cloud-fan explained above), thus data may be copied twice and leads to performance regression.

@cloud-fan cloud-fan reopened this Sep 14, 2016
@cloud-fan
Copy link
Contributor Author

After some more discussion, I'm going to reopen it:

  1. InternalRow.copy breaks the copy contract and the fix in this PR is necessary
  2. MutableProjection may bite us in the future if we don't fix it
  3. It turns out Collect is the only ImperativeAggregate that we need to fix while updating aggregation buffer, so it's not a lot of work.

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65386 has finished for PR 15082 at commit f83e291.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

externalRow should be theSameInstanceAs externalRow.copy()
}

it("copy should return same ref for internal rows") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also fix the external row? Why copy should return same ref?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy returned the same ref because it is supposed to be immutable. See #10553 for more context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, but it's not true for internal row right? It can be mutable so it's safe to remove this test.

Copy link
Contributor

@hvanhovell hvanhovell Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well MutableRow is mutable, so it shouldn't hold for those. The only exception is GenericInternalRow.

That being said, I don't mind if you remove/modify the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as InternalRow can have mutable implementation, InternalRow is not immutable anymore, because it can have a struct field, whose value can be a MutableRow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline: I don't see the point of having an immutable GenericInternalRow if we cannot guarantee its immutability. We could just make every InternalRow a mutable one, and simplify the class structure in the process. I am not sure if we should make that part of the current PR though.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65418 has finished for PR 15082 at commit da6a285.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65561 has finished for PR 15082 at commit da6a285.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

What's the status of this issue? I see that it's currently targeted for 2.0.1 in JIRA and wanted to ping to make sure that this doesn't miss the cut in case we prepare an RC soon.

@cloud-fan
Copy link
Contributor Author

I re-targeted it to 2.1 only.

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65807 has finished for PR 15082 at commit c56de6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants