Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

  • Better integrated the TreeNode objects. with the language built-in collection utilities.

e.g.

val map: Map[Expression, Expression]=..
map.get(expr)
map.contains(expr)

// or
val set: Set[SparkPlan] = ...
if (set.contains(sparkPlan)) {
  ...
}

Currently we may not able to make the TreeNode object work with built-in collections as code shows above, because the methods equals and hashCode of case class AttributeReference is for literally not semantically in object comparison.

  • TreeNode transformation APIs don't work with a identical instances return by a rule, as it has a fastEquals guard, we loose the guard by checking the instance reference equality only.

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #33988 has finished for PR 6587 at commit 869d7cf.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #33991 has finished for PR 6587 at commit 7aaa837.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34043 has finished for PR 6587 at commit 6b4d353.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I agree this change can make things simple and actually I have tried to do it before... But think about AttributeReference.withName, it won't have effect as the tree node library will regard it as no-change and still keep the old tree. If AttributeReference should not care about the name, then we need to figure out why we need AttributeReference.withName.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34056 has finished for PR 6587 at commit 5245542.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34070 has finished for PR 6587 at commit 560da1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34083 has finished for PR 6587 at commit 27ef8f3.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

@cloud-fan I've update the TreeNode.fastEquals, it doesn't reply on equals any more, it should solve the problem you are talking.

@SparkQA
Copy link

SparkQA commented Jun 4, 2015

Test build #34150 has finished for PR 6587 at commit 35f1892.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ElementwiseProduct(val scalingVec: Vector) extends VectorTransformer
    • trait TypeCheckResult
    • case class TypeCheckFailure(message: String) extends TypeCheckResult
    • abstract class UnaryArithmetic extends UnaryExpression
    • case class UnaryMinus(child: Expression) extends UnaryArithmetic
    • case class Sqrt(child: Expression) extends UnaryArithmetic
    • case class Abs(child: Expression) extends UnaryArithmetic
    • case class BitwiseNot(child: Expression) extends UnaryArithmetic
    • case class MaxOf(left: Expression, right: Expression) extends BinaryArithmetic
    • case class MinOf(left: Expression, right: Expression) extends BinaryArithmetic
    • case class Atan2(left: Expression, right: Expression)
    • case class Hypot(left: Expression, right: Expression)
    • case class EqualTo(left: Expression, right: Expression) extends BinaryComparison

@SparkQA
Copy link

SparkQA commented Jun 4, 2015

Test build #34155 has finished for PR 6587 at commit f1ddbf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about this change. Before it, we can create a same tree node(not return the origin one) during tree node transformation which will be regarded as no-change and finally make the batch reach fixed point. However, we can't do this now and may make a batch exceed max iterations which will slow down batch execution dramatically.
cc @marmbrus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that also before I made this change, but I don't think that the strong reason we should stop this change.

For most of cases in the code, we returns the same references by well-design the rules for a TreeNode object, and if the code still keep creating the identical objects(.equals returns true) in its rule for every iteration, even unnecessary, can this be considered as a bug of the user code?

I think it's the responsibility for user code to decide whether TreeNode object substitutions should be taken (via creating new instance), as user code always knows when a object substitution needed, right? That's also give more freedom for user code to define the .equals() in a semantic way for TreeNode object.

@marmbrus
Copy link
Contributor

I am strongly opposed to these changes. In a previous version of catalyst we had the equality function for AttributeReference not include other things like the name and the qualifiers and it resulted in even more confusing bugs. I think @cloud-fan 's performance concerns are also valid.

@chenghao-intel
Copy link
Contributor Author

@marmbrus thanks for explanation.
I am not so sure what kind of bugs you've seen, but as you know we have lots of places use the Set[Expression] and Map[Expression], those scala/java collections are naturely rely on the .hashCode() and equals() method, not the semanticEquals, and definitely will cause weird bugs for the developers who is new to Catalyst. I believe that will keep happening.

For performance concern, I don't think that's a issue! If the user code return a different instance reference from the transformation rule, why should we call its equal() method and then decide whether the substitution takes, why not just return the same instance if the user code doesn't want to change anything? One scenario that I can assume is the deep copy a TreeNode for further analysis without change the original one, as the equals() will be considered in the substitution, I am not sure what's the easiest way to do this.

@chenghao-intel
Copy link
Contributor Author

It's glad to talk with @liancheng offline, I reverted the change for TreeNode.fastEquals, and replaced the fastEquals with eq for the TreeNode transformation operations.

@SparkQA
Copy link

SparkQA commented Jun 23, 2015

Test build #35550 has finished for PR 6587 at commit aa136c5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BroadcastHint(child: LogicalPlan) extends UnaryNode

@chenghao-intel
Copy link
Contributor Author

Will rebase after #5780 merged.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35632 has finished for PR 6587 at commit c3952a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35679 has finished for PR 6587 at commit bd73a9c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35681 has finished for PR 6587 at commit 71e05ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35686 has finished for PR 6587 at commit bd73a9c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35690 has finished for PR 6587 at commit 562acf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

I'm going to again voice my objection here. At the core there is a fundamental problem: we have two types of equality that we care about. Structural equality (i.e. all of the fields of the two classes are the same) and reference equality (these two attributes are referring to the same spot in the input tuple).

I believe that it would be confusing to have equals and hash code refer to anything other than structural equality. We cannot get rid of the name part of attribute references (or ignore it in equality) because we are case preserving even when we are case insensitive. So attributes that have different names are different.

I don't think that it is too big of a burden for developers to watch for these types of equality and make sure they are applied properly when doing code review. I do think that large refactorings like this are likely to introduce regressions.

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus for your patient to explain this again and again. :-)
My point is essentially what the AttributeReference is. The name, qualifiers most likely the accessories used by analyzer, but the exprId and metadata are used more widely in user code.

I am complaining some of the unreasonable implementation in the method equals and hashCode of the AttributeReference, like in the structurally testing, should we consider the qualifiers as well? as we do compare the name in current code, and why not take the name into account in hashCode? etc.

I just wondering where most likely we will write code like Set[Expression].contains(expr), and what the assumption of the developers in this code snippet, that's the motive I ignore the namein the equality testing.

(Previously, I did change the argument lists as below, however, it seems lots of existed code impacted, so I change it back and still overriding the method equals and hashCode, otherwise, we can remove them too.)

case class AttributeReference(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    override val metadata: Metadata = Metadata.empty)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifiers: Seq[String] = Nil) extends Attribute

// V.S.

case class AttributeReference(
    val exprId: ExprId = NamedExpression.newExprId,
    override val metadata: Metadata = Metadata.empty)(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    val qualifiers: Seq[String] = Nil
) extends Attribute

Thus it's not a big burden for developers (by using the semanticEquals), but it is probably very error-prone / inconvenient, particularly in the aggregation optimizations, even the catalyst extensions(lots of TreeNode object substitution case).

@marmbrus
Copy link
Contributor

I think that its useful to have two parameter lists here as you often only want to match on a subset of the attributes. That equals and hashCode don't care about the qualifiers is a bug and should be fixed.

@chenghao-intel
Copy link
Contributor Author

OK, another potential bug is to check the identity for 2 logical plans(e.g. in CacheManager?), we need update the code for LogicalPlan.sameResult also.

@chenghao-intel
Copy link
Contributor Author

closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants