[SPARK-20229][SQL] add semanticHash to QueryPlan #17541

cloud-fan · 2017-04-05T17:56:22Z

What changes were proposed in this pull request?

Like Expression, QueryPlan should also have a semanticHash method, then we can put plans to a hash map and look it up fast. This PR refactors QueryPlan to follow Expression and put all the normalization logic in QueryPlan.canonicalized, so that it's very natural to implement semanticHash.

follow-up: improve CacheManager to leverage this semanticHash and speed up plan lookup, instead of iterating all cached plans.

How was this patch tested?

existing tests. Note that we don't need to test the semanticHash method, once the existing tests prove sameResult is correct, we are good.

cloud-fan · 2017-04-05T17:56:33Z

cc @rxin @gatorsmile

cloud-fan · 2017-04-05T17:58:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

I'll revert it once #17537 is merged

cloud-fan · 2017-04-05T18:00:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala

I think it was wrong previously, sameResult should be commutative

cloud-fan · 2017-04-05T18:00:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala

SparkQA · 2017-04-05T18:14:21Z

Test build #75550 has finished for PR 17541 at commit 02f4a02.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-04-05T19:58:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala

why are we getting rid of this?

BroadcastMode is a field of BroadcastExchangeExec. Since we need to canonicalize a QueryPlan, the BroadcastMode also need to be canonicalized.

gatorsmile · 2017-04-06T05:18:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

We only care about table identifier. How about setting them to Nil

dataCols = Nil, partitionCols = Nil

CatalogRelation has 3 asserts at the beginning, so we can't simply use Nil

cloud-fan · 2017-04-06T06:27:33Z

retest this please

SparkQA · 2017-04-06T06:49:10Z

Test build #75563 has finished for PR 17541 at commit 3cb7782.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-06T07:04:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

To use ==, we need to overwrite many equals and hashCode function?

@gatorsmile Given that both sides are canonicalized, the default case class equals and hash code methods should work, right ?

@dilipbiswal Yes or no. If it is case class, scala compiler will help you define a default equals. If it is class, you need to define it by yourselves. For example,

class NestedObj (i: Int) val m = new NestedObj(3) val n = new NestedObj(3) assert(m != n)

TreeNode requires its implementations to be Product, I think all of the LogicalPlans and SparkPlans are case class

This is also how Expression.semanticEquals works

Yes, we already assume it in the existing solution. I just realized it. : )

We also need to ensure all the arguments of the case class are primitive types or from the class with a defined equals.

class NestedObj (i: Int) val m = new NestedObj(3) val n = new NestedObj(3) assert(m != n) case class Obj (i: NestedObj) val p = Obj(m) val q = Obj(n) assert(p != q)

SparkQA · 2017-04-06T08:51:08Z

Test build #75567 has finished for PR 17541 at commit b261e71.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2017-04-06T16:49:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

@cloud-fan this is very nice idea :-) I was running into the same issue when you asked me to try normalizing the attributes in my caching pr.

SparkQA · 2017-04-06T17:16:02Z

Test build #75577 has finished for PR 17541 at commit 6536cd6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-06T17:17:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

I'll bring it back after #17552 is merged

SparkQA · 2017-04-06T19:09:09Z

Test build #75580 has finished for PR 17541 at commit 99f8ad3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-07T09:39:23Z

Test build #75599 has finished for PR 17541 at commit bb930b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-08T10:55:11Z

Test build #75620 has finished for PR 17541 at commit 9305187.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-08T13:48:54Z

cc @gatorsmile any more comments?

cloud-fan · 2017-04-08T13:58:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

    Objects.hashCode(tableMeta.identifier, output)
  }

-  /** Only compare table identifier. */


Actually we should compare more, e.g. if the table schema is altered, the new table relation should not be considered as same with the old table relation, even after canonicalization. Also, it's tricky to remove the output of a plan during canonicalization as the parenting plan may rely on the output.

gatorsmile · 2017-04-08T17:38:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+
+  /**
+   * Do some simple transformation on this plan before canonicalizing. Implementations can override
+   * this method to provide customer canonicalize logic without rewriting the whole logic.


customer -> customized

gatorsmile · 2017-04-08T17:57:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+      case ar: AttributeReference =>
+        val ordinal = input.indexOf(ar.exprId)
+        if (ordinal == -1) {
+          ar


No need to normalize exprIds in this case?

no, actually this is unexpected, the attribute should either reference to input attributes, or represent new output at top level. Keep it unchanged so that the equality check will fail later.

gatorsmile · 2017-04-08T18:43:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      Map(
        "Format" -> relation.fileFormat.toString,
-        "ReadSchema" -> outputSchema.catalogString,
+        "requiredSchema" -> requiredSchema.catalogString,


This is also for display in SparkPlanInfo? Keep the original name ReadSchema?

gatorsmile · 2017-04-09T05:24:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

-  // expId can be different but the relation is still the same.
-  override lazy val cleanArgs: Seq[Any] = Seq(relation)
+  // Only care about relation when canonicalizing.
+  override def preCanonicalized: LogicalPlan = copy(catalogTable = None)


The builders of external data sources need to implement equals and hashCode if they want to utilize our cache management.

yes, it's the same behavior as before

gatorsmile · 2017-04-09T06:03:53Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

+      partitionPruningPred.map(normalizeExprId(_, input)))(sparkSession)
  }
+
+  override def otherCopyArgs: Seq[AnyRef] = Seq(sparkSession)


This sounds a bug fix.

gatorsmile · 2017-04-09T06:54:53Z

LGTM

SparkQA · 2017-04-09T18:49:08Z

Test build #75635 has finished for PR 17541 at commit 295acc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2017-04-09T22:30:19Z

retest this please

SparkQA · 2017-04-10T00:30:51Z

Test build #75637 has finished for PR 17541 at commit 295acc9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-10T00:32:46Z

retest this please

SparkQA · 2017-04-10T01:12:41Z

Test build #75639 has finished for PR 17541 at commit 295acc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-04-10T02:20:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+  lazy val canonicalized: PlanType = {
+    val canonicalizedChildren = children.map(_.canonicalized)
+    var id = -1
+    preCanonicalized.mapExpressions {


Do we need to consider non-deterministic expressions?

see Expression.semanticEquals, non-deterministic expressions will never equal to other expressions.

cloud-fan · 2017-04-10T02:41:34Z

retest this please

SparkQA · 2017-04-10T05:07:14Z

Test build #75640 has finished for PR 17541 at commit 295acc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-10T05:36:23Z

thanks for the review, merging to master!

cloud-fan commented Apr 5, 2017

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated

Copy link

Contributor Author

cloud-fan Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert it once #17537 is merged

cloud-fan commented Apr 5, 2017

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala Outdated

Copy link

Contributor Author

cloud-fan Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

rxin reviewed Apr 5, 2017

View reviewed changes

gatorsmile reviewed Apr 6, 2017

View reviewed changes

cloud-fan force-pushed the plan-semantic branch from 02f4a02 to 3cb7782 Compare April 6, 2017 06:30

gatorsmile reviewed Apr 6, 2017

View reviewed changes

cloud-fan force-pushed the plan-semantic branch from 3cb7782 to b261e71 Compare April 6, 2017 08:30

cloud-fan force-pushed the plan-semantic branch from b261e71 to 6536cd6 Compare April 6, 2017 15:50

dilipbiswal reviewed Apr 6, 2017

View reviewed changes

cloud-fan commented Apr 6, 2017

View reviewed changes

cloud-fan force-pushed the plan-semantic branch from 6536cd6 to 99f8ad3 Compare April 6, 2017 17:50

cloud-fan force-pushed the plan-semantic branch from 99f8ad3 to bb930b7 Compare April 7, 2017 08:20

add semanticHash to QueryPlan

9305187

cloud-fan force-pushed the plan-semantic branch from bb930b7 to 9305187 Compare April 8, 2017 08:32

cloud-fan commented Apr 8, 2017

View reviewed changes

gatorsmile reviewed Apr 8, 2017

View reviewed changes

gatorsmile reviewed Apr 9, 2017

View reviewed changes

address comments

295acc9

viirya reviewed Apr 10, 2017

View reviewed changes

asfgit closed this in 3d7f201 Apr 10, 2017

gatorsmile mentioned this pull request May 14, 2017

[SPARK-20725][SQL][BRANCH-2.1] partial aggregate should behave correctly for sameResult #17975

Closed

prithvikannan mentioned this pull request Apr 21, 2023

[Feature branch] Autologging for datasets mlflow/mlflow#8202

Merged

33 tasks

[SPARK-20229][SQL] add semanticHash to QueryPlan #17541

[SPARK-20229][SQL] add semanticHash to QueryPlan #17541

Uh oh!

Conversation

cloud-fan commented Apr 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

rxin Apr 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 6, 2017

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

SparkQA commented Apr 7, 2017

Uh oh!

SparkQA commented Apr 8, 2017

Uh oh!

cloud-fan commented Apr 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Apr 9, 2017

Uh oh!

SparkQA commented Apr 9, 2017

Uh oh!

rxin Apr 5, 2017 •

edited

Loading