Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Like Expression, QueryPlan should also have a semanticHash method, then we can put plans to a hash map and look it up fast. This PR refactors QueryPlan to follow Expression and put all the normalization logic in QueryPlan.canonicalized, so that it's very natural to implement semanticHash.

follow-up: improve CacheManager to leverage this semanticHash and speed up plan lookup, instead of iterating all cached plans.

How was this patch tested?

existing tests. Note that we don't need to test the semanticHash method, once the existing tests prove sameResult is correct, we are good.

@cloud-fan
Copy link
Contributor Author

cc @rxin @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert it once #17537 is merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was wrong previously, sameResult should be commutative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@SparkQA
Copy link

SparkQA commented Apr 5, 2017

Test build #75550 has finished for PR 17541 at commit 02f4a02.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@rxin rxin Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we getting rid of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BroadcastMode is a field of BroadcastExchangeExec. Since we need to canonicalize a QueryPlan, the BroadcastMode also need to be canonicalized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only care about table identifier. How about setting them to Nil

dataCols = Nil,
partitionCols = Nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogRelation has 3 asserts at the beginning, so we can't simply use Nil

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75563 has finished for PR 17541 at commit 3cb7782.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use ==, we need to overwrite many equals and hashCode function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Given that both sides are canonicalized, the default case class equals and hash code methods should work, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal Yes or no. If it is case class, scala compiler will help you define a default equals. If it is class, you need to define it by yourselves. For example,

class NestedObj (i: Int)

val m = new NestedObj(3)
val n = new NestedObj(3)

assert(m != n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TreeNode requires its implementations to be Product, I think all of the LogicalPlans and SparkPlans are case class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how Expression.semanticEquals works

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we already assume it in the existing solution. I just realized it. : )

We also need to ensure all the arguments of the case class are primitive types or from the class with a defined equals.

class NestedObj (i: Int)

val m = new NestedObj(3)
val n = new NestedObj(3)

assert(m != n)

case class Obj (i: NestedObj)

val p = Obj(m)
val q = Obj(n)

assert(p != q)

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75567 has finished for PR 17541 at commit b261e71.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan this is very nice idea :-) I was running into the same issue when you asked me to try normalizing the attributes in my caching pr.

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75577 has finished for PR 17541 at commit 6536cd6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bring it back after #17552 is merged

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75580 has finished for PR 17541 at commit 99f8ad3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2017

Test build #75599 has finished for PR 17541 at commit bb930b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2017

Test build #75620 has finished for PR 17541 at commit 9305187.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile any more comments?

Objects.hashCode(tableMeta.identifier, output)
}

/** Only compare table identifier. */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should compare more, e.g. if the table schema is altered, the new table relation should not be considered as same with the old table relation, even after canonicalization. Also, it's tricky to remove the output of a plan during canonicalization as the parenting plan may rely on the output.


/**
* Do some simple transformation on this plan before canonicalizing. Implementations can override
* this method to provide customer canonicalize logic without rewriting the whole logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

customer -> customized

case ar: AttributeReference =>
val ordinal = input.indexOf(ar.exprId)
if (ordinal == -1) {
ar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to normalize exprIds in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, actually this is unexpected, the attribute should either reference to input attributes, or represent new output at top level. Keep it unchanged so that the equality check will fail later.

Map(
"Format" -> relation.fileFormat.toString,
"ReadSchema" -> outputSchema.catalogString,
"requiredSchema" -> requiredSchema.catalogString,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also for display in SparkPlanInfo? Keep the original name ReadSchema?

// expId can be different but the relation is still the same.
override lazy val cleanArgs: Seq[Any] = Seq(relation)
// Only care about relation when canonicalizing.
override def preCanonicalized: LogicalPlan = copy(catalogTable = None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builders of external data sources need to implement equals and hashCode if they want to utilize our cache management.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's the same behavior as before

partitionPruningPred.map(normalizeExprId(_, input)))(sparkSession)
}

override def otherCopyArgs: Seq[AnyRef] = Seq(sparkSession)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bug fix.

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Apr 9, 2017

Test build #75635 has finished for PR 17541 at commit 295acc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75637 has finished for PR 17541 at commit 295acc9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75639 has finished for PR 17541 at commit 295acc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

lazy val canonicalized: PlanType = {
val canonicalizedChildren = children.map(_.canonicalized)
var id = -1
preCanonicalized.mapExpressions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider non-deterministic expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see Expression.semanticEquals, non-deterministic expressions will never equal to other expressions.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75640 has finished for PR 17541 at commit 295acc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants