Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

EnsureRequirement in its reorder method currently assumes that the same key appears only once in the join condition. This of course might not be the case, and when it is not satisfied, it returns a wrong plan which produces a wrong result of the query.

How was this patch tested?

added UT

currentOrderOfKeys: Seq[Expression]): (Seq[Expression], Seq[Expression]) = {
val leftKeysBuffer = ArrayBuffer[Expression]()
val rightKeysBuffer = ArrayBuffer[Expression]()
val alreadyUsedIndexes = mutable.Set[Int]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe pickedIndexes?

@cloud-fan
Copy link
Contributor

good catch! thanks!

@SparkQA
Copy link

SparkQA commented Jun 11, 2018

Test build #91666 has finished for PR 21529 at commit 06858cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

withSQLConf(("spark.sql.shuffle.partitions", "1"),
("spark.sql.constraintPropagation.enabled", "false"),
("spark.sql.autoBroadcastJoinThreshold", "-1")) {
val df1 = spark.range(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make sure range has more than one partitions, otherwise we can't reproduce the bug.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 11, 2018

I was surprised that this bug is present even if the table is not bucketed, and then I found out another problem in the code: reorderJoinPredicates transform the plan again. reorderJoinPredicates is called during plan transformation and we should not double transform the plan.

This makes the bug has a much larger impact: when we do a shuffle join, we add shuffle exchange to both of the join sides, which then triggers reorderJoinPredicates because of the double transformation problem.

@mgaido91 let's also remove the transformUp in reorderJoinPredicates

also cc @tejasapatil

@cloud-fan
Copy link
Contributor

one followup: we should improve our golden file sql test, to run same queries with different configs. This is pretty important for the join tests, otherwise we only test broadcast join.

@mgaido91
Copy link
Contributor Author

thanks for your review @cloud-fan. Nice catch on the transformUp! I addressed all your comments.

As far as the followup is regarded, we should decide if we want to support the possibility that with different configs we get different results or not. In the second case we have more flexibility, but we have also to be much more careful in order not to introduce bugs when different configs should not produce different results. Moreover, we would also have many more golden files...
What do you think?

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91705 has finished for PR 21529 at commit 341f1b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91710 has finished for PR 21529 at commit 341f1b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("SPARK-24495: EnsureRequirements can return wrong plan when reusing the same key in join") {
withSQLConf(("spark.sql.shuffle.partitions", "1"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use hard coded config name, use SQLConf.SHUFFLE_PARTITIONS instead.

a nit: we usually use a -> b, c -> d, ... to specify config pairs.

withSQLConf(("spark.sql.shuffle.partitions", "1"),
("spark.sql.constraintPropagation.enabled", "false"),
("spark.sql.autoBroadcastJoinThreshold", "-1")) {
val df1 = spark.range(100).repartition(2, $"id", $"id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a simpler way spark.range(0, 100, 1, numPartitions = 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way would not be ok, as we would have a RangePartitioning while the issue appears only with HashPartitioning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the issue can happen with range partition(because of the double transformation issue), the code in the ticket can reproduce the bug and it has no hash partitioning.

Copy link
Contributor Author

@mgaido91 mgaido91 Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but if we remove the transformUp as you correctly suggested, if we do not use HashPartitioning we do not test the proper behavior of the reorder method.

Anyway, I added another test to PlannerSuite which checks the behavior of the reorder method, so I will follow your suggestion here, thanks.

("spark.sql.constraintPropagation.enabled", "false"),
("spark.sql.autoBroadcastJoinThreshold", "-1")) {
val df1 = spark.range(100).repartition(2, $"id", $"id")
val df2 = spark.range(100).select(($"id" * 2).as("b1"), (- $"id").as("b2"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

($"id" * 2).as("b1") -> $"id".as("b1"), to minimize the test.

@cloud-fan
Copy link
Contributor

I think we need to support both. testing different physical operators needs same result, testing something like type coercion mode needs different result. Anyway let's discuss it in the followup.

@gatorsmile
Copy link
Member

@mgaido91 Could you help improve the test coverage of joins org.apache.spark.sql.JoinSuite? Due to the incomplete test case coverage, we did not discover this at the very beginning. We need to add more test cases to cover different join algorithms.

@mgaido91
Copy link
Contributor Author

I think we need to support both. testing different physical operators needs same result, testing something like type coercion mode needs different result. Anyway let's discuss it in the followup.

ok @cloud-fan, I'll try and send a proposal in the next days.

@mgaido91 Could you help improve the test coverage of joins org.apache.spark.sql.JoinSuite? Due to the incomplete test case coverage, we did not discover this at the very beginning. We need to add more test cases to cover different join algorithms.

Sure, @gatorsmile, I am happy to. Do you mean running the existing tests for every type of join or do you have something different in mind? Thanks.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 13, 2018

I think for this PR, apart from the end-to-end test for checking result, we should also have a unit test in PlannerSuite to cover this case for EnsureRequirements. In the followup, we can add the config support for the golden file tests and add more test cases.

outputPlan match {
case SortMergeJoinExec(leftKeys, rightKeys, _, _, _, _) =>
assert(leftKeys == Seq(exprA, exprA))
assert(rightKeys.contains(exprB) && rightKeys.contains(exprC))
Copy link
Contributor

@cloud-fan cloud-fan Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to check rightKeys == Seq(exprB, exprC)?

SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
val df1 = spark.range(0, 100, 1, 2)
val df2 = spark.range(100).select($"id".as("b1"), (- $"id").as("b2"))
val res = df1.join(df2, $"id" === $"b1" && $"id" === $"b2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one difference between this test and the code in JIRA ticket is, the code in JIRA ticket has a Project above join, to trigger the double transformation issue. We should add a Project and make sure this test does fail without this patch.

}
}

test("SPARK-24495: EnsureRequirements can return wrong plan when reusing the same key in join") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as end-to-end test, maybe a better name is: SPARK-24495: Join may return wrong result when having duplicated equal-join keys

@cloud-fan
Copy link
Contributor

LGTM except some minor comments about test

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91773 has finished for PR 21529 at commit 40abcff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

@mgaido91 can you fix the conflicts? thanks!

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91790 has finished for PR 21529 at commit 6553c27.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91797 has finished for PR 21529 at commit 6553c27.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91826 has finished for PR 21529 at commit 6ef4f0d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91843 has finished for PR 21529 at commit 6ef4f0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Jun 14, 2018
…ng equal keys

`EnsureRequirement` in its `reorder` method currently assumes that the same key appears only once in the join condition. This of course might not be the case, and when it is not satisfied, it returns a wrong plan which produces a wrong result of the query.

added UT

Author: Marco Gaido <[email protected]>

Closes #21529 from mgaido91/SPARK-24495.

(cherry picked from commit fdadc4b)
Signed-off-by: Xiao Li <[email protected]>
@asfgit asfgit closed this in fdadc4b Jun 14, 2018
@mgaido91
Copy link
Contributor Author

Thanks @gatorsmile. Sorry, may I ask what you think about #21529 (comment)? Thanks.

@gatorsmile
Copy link
Member

Adding new queries to SQLQueryTestSuite is the best way to do it in the current infrastructure. Do your best to cover all the join algorithms for different input data and join types?

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…ng equal keys

`EnsureRequirement` in its `reorder` method currently assumes that the same key appears only once in the join condition. This of course might not be the case, and when it is not satisfied, it returns a wrong plan which produces a wrong result of the query.

added UT

Author: Marco Gaido <[email protected]>

Closes apache#21529 from mgaido91/SPARK-24495.

(cherry picked from commit fdadc4b)

Ref: LIHADOOP-40100

RB=1410545
BUG=LIHADOOP-40100
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=edlu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants