[SPARK-7269] [SQL] [WIP] Refactor the class AttributeReference #6587

chenghao-intel · 2015-06-02T14:31:01Z

Better integrated the TreeNode objects. with the language built-in collection utilities.

e.g.

val map: Map[Expression, Expression]=..
map.get(expr)
map.contains(expr)

// or
val set: Set[SparkPlan] = ...
if (set.contains(sparkPlan)) {
  ...
}

Currently we may not able to make the TreeNode object work with built-in collections as code shows above, because the methods equals and hashCode of case class AttributeReference is for literally not semantically in object comparison.

TreeNode transformation APIs don't work with a identical instances return by a rule, as it has a fastEquals guard, we loose the guard by checking the instance reference equality only.

SparkQA · 2015-06-02T14:36:19Z

Test build #33988 has finished for PR 6587 at commit 869d7cf.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-02T15:14:50Z

Test build #33991 has finished for PR 6587 at commit 7aaa837.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-03T01:25:32Z

Test build #34043 has finished for PR 6587 at commit 6b4d353.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-06-03T03:01:42Z

I agree this change can make things simple and actually I have tried to do it before... But think about AttributeReference.withName, it won't have effect as the tree node library will regard it as no-change and still keep the old tree. If AttributeReference should not care about the name, then we need to figure out why we need AttributeReference.withName.

SparkQA · 2015-06-03T04:35:19Z

Test build #34056 has finished for PR 6587 at commit 5245542.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-03T06:48:22Z

Test build #34070 has finished for PR 6587 at commit 560da1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-03T15:42:09Z

Test build #34083 has finished for PR 6587 at commit 27ef8f3.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

chenghao-intel · 2015-06-04T03:21:18Z

@cloud-fan I've update the TreeNode.fastEquals, it doesn't reply on equals any more, it should solve the problem you are talking.

SparkQA · 2015-06-04T04:24:32Z

Test build #34150 has finished for PR 6587 at commit 35f1892.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ElementwiseProduct(val scalingVec: Vector) extends VectorTransformer
- trait TypeCheckResult
- case class TypeCheckFailure(message: String) extends TypeCheckResult
- abstract class UnaryArithmetic extends UnaryExpression
- case class UnaryMinus(child: Expression) extends UnaryArithmetic
- case class Sqrt(child: Expression) extends UnaryArithmetic
- case class Abs(child: Expression) extends UnaryArithmetic
- case class BitwiseNot(child: Expression) extends UnaryArithmetic
- case class MaxOf(left: Expression, right: Expression) extends BinaryArithmetic
- case class MinOf(left: Expression, right: Expression) extends BinaryArithmetic
- case class Atan2(left: Expression, right: Expression)
- case class Hypot(left: Expression, right: Expression)
- case class EqualTo(left: Expression, right: Expression) extends BinaryComparison

SparkQA · 2015-06-04T07:22:51Z

Test build #34155 has finished for PR 6587 at commit f1ddbf1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-06-05T14:21:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

I'm a little nervous about this change. Before it, we can create a same tree node(not return the origin one) during tree node transformation which will be regarded as no-change and finally make the batch reach fixed point. However, we can't do this now and may make a batch exceed max iterations which will slow down batch execution dramatically.
cc @marmbrus

I thought that also before I made this change, but I don't think that the strong reason we should stop this change.

For most of cases in the code, we returns the same references by well-design the rules for a TreeNode object, and if the code still keep creating the identical objects(.equals returns true) in its rule for every iteration, even unnecessary, can this be considered as a bug of the user code?

I think it's the responsibility for user code to decide whether TreeNode object substitutions should be taken (via creating new instance), as user code always knows when a object substitution needed, right? That's also give more freedom for user code to define the .equals() in a semantic way for TreeNode object.

marmbrus · 2015-06-11T18:33:59Z

I am strongly opposed to these changes. In a previous version of catalyst we had the equality function for AttributeReference not include other things like the name and the qualifiers and it resulted in even more confusing bugs. I think @cloud-fan 's performance concerns are also valid.

chenghao-intel · 2015-06-12T00:45:39Z

@marmbrus thanks for explanation.
I am not so sure what kind of bugs you've seen, but as you know we have lots of places use the Set[Expression] and Map[Expression], those scala/java collections are naturely rely on the .hashCode() and equals() method, not the semanticEquals, and definitely will cause weird bugs for the developers who is new to Catalyst. I believe that will keep happening.

For performance concern, I don't think that's a issue! If the user code return a different instance reference from the transformation rule, why should we call its equal() method and then decide whether the substitution takes, why not just return the same instance if the user code doesn't want to change anything? One scenario that I can assume is the deep copy a TreeNode for further analysis without change the original one, as the equals() will be considered in the substitution, I am not sure what's the easiest way to do this.

chenghao-intel · 2015-06-23T13:07:54Z

It's glad to talk with @liancheng offline, I reverted the change for TreeNode.fastEquals, and replaced the fastEquals with eq for the TreeNode transformation operations.

SparkQA · 2015-06-23T14:20:30Z

Test build #35550 has finished for PR 6587 at commit aa136c5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastHint(child: LogicalPlan) extends UnaryNode

chenghao-intel · 2015-06-23T15:11:18Z

Will rebase after #5780 merged.

SparkQA · 2015-06-24T05:30:24Z

Test build #35632 has finished for PR 6587 at commit c3952a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-24T14:21:21Z

Test build #35679 has finished for PR 6587 at commit bd73a9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-24T14:39:51Z

Test build #35681 has finished for PR 6587 at commit 71e05ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-24T15:45:26Z

Test build #35686 has finished for PR 6587 at commit bd73a9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-24T16:38:15Z

Test build #35690 has finished for PR 6587 at commit 562acf1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-06-24T19:08:59Z

I'm going to again voice my objection here. At the core there is a fundamental problem: we have two types of equality that we care about. Structural equality (i.e. all of the fields of the two classes are the same) and reference equality (these two attributes are referring to the same spot in the input tuple).

I believe that it would be confusing to have equals and hash code refer to anything other than structural equality. We cannot get rid of the name part of attribute references (or ignore it in equality) because we are case preserving even when we are case insensitive. So attributes that have different names are different.

I don't think that it is too big of a burden for developers to watch for these types of equality and make sure they are applied properly when doing code review. I do think that large refactorings like this are likely to introduce regressions.

chenghao-intel · 2015-06-25T01:10:40Z

Thank you @marmbrus for your patient to explain this again and again. :-)
My point is essentially what the AttributeReference is. The name, qualifiers most likely the accessories used by analyzer, but the exprId and metadata are used more widely in user code.

I am complaining some of the unreasonable implementation in the method equals and hashCode of the AttributeReference, like in the structurally testing, should we consider the qualifiers as well? as we do compare the name in current code, and why not take the name into account in hashCode? etc.

I just wondering where most likely we will write code like Set[Expression].contains(expr), and what the assumption of the developers in this code snippet, that's the motive I ignore the namein the equality testing.

(Previously, I did change the argument lists as below, however, it seems lots of existed code impacted, so I change it back and still overriding the method equals and hashCode, otherwise, we can remove them too.)

case class AttributeReference(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    override val metadata: Metadata = Metadata.empty)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifiers: Seq[String] = Nil) extends Attribute

// V.S.

case class AttributeReference(
    val exprId: ExprId = NamedExpression.newExprId,
    override val metadata: Metadata = Metadata.empty)(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    val qualifiers: Seq[String] = Nil
) extends Attribute

Thus it's not a big burden for developers (by using the semanticEquals), but it is probably very error-prone / inconvenient, particularly in the aggregation optimizations, even the catalyst extensions(lots of TreeNode object substitution case).

marmbrus · 2015-06-29T23:39:38Z

I think that its useful to have two parameter lists here as you often only want to match on a subset of the attributes. That equals and hashCode don't care about the qualifiers is a bug and should be fixed.

chenghao-intel · 2015-06-30T15:03:41Z

OK, another potential bug is to check the identity for 2 logical plans(e.g. in CacheManager?), we need update the code for LogicalPlan.sameResult also.

chenghao-intel · 2015-08-17T01:42:57Z

closing it.

chenghao-intel force-pushed the attr_ref branch from 27ef8f3 to 35f1892 Compare June 4, 2015 03:14

cloud-fan reviewed Jun 5, 2015
View reviewed changes

chenghao-intel force-pushed the attr_ref branch from f1ddbf1 to aa136c5 Compare June 23, 2015 13:03

chenghao-intel force-pushed the attr_ref branch from aa136c5 to c3952a1 Compare June 24, 2015 04:14

remove the semanticEquals from expression

bd73a9c

chenghao-intel force-pushed the attr_ref branch from c3952a1 to bd73a9c Compare June 24, 2015 13:01

chenghao-intel force-pushed the attr_ref branch from 71e05ba to bd73a9c Compare June 24, 2015 14:24

remove the AttributeEquals

562acf1

chenghao-intel mentioned this pull request Jul 15, 2015

[SPARK-8972][SQL]Incorrect result for rollup #7343

Closed

chenghao-intel closed this Aug 17, 2015

chenghao-intel mentioned this pull request Nov 6, 2015

[SPARK-10371] [SQL] Implement subexpr elimination for UnsafeProjections #9480

Closed

[SPARK-7269] [SQL] [WIP] Refactor the class AttributeReference #6587

[SPARK-7269] [SQL] [WIP] Refactor the class AttributeReference #6587

Uh oh!

Conversation

chenghao-intel commented Jun 2, 2015

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

cloud-fan commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

chenghao-intel commented Jun 4, 2015

Uh oh!

SparkQA commented Jun 4, 2015

Uh oh!

SparkQA commented Jun 4, 2015

Uh oh!

cloud-fan Jun 5, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Jun 8, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jun 11, 2015

Uh oh!

chenghao-intel commented Jun 12, 2015

Uh oh!

chenghao-intel commented Jun 23, 2015

Uh oh!

SparkQA commented Jun 23, 2015

Uh oh!

chenghao-intel commented Jun 23, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

marmbrus commented Jun 24, 2015

Uh oh!

chenghao-intel commented Jun 25, 2015

Uh oh!

marmbrus commented Jun 29, 2015

Uh oh!

chenghao-intel commented Jun 30, 2015

Uh oh!

chenghao-intel commented Aug 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants