-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7269] [SQL] [WIP] Refactor the class AttributeReference #6587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #33988 has finished for PR 6587 at commit
|
|
Test build #33991 has finished for PR 6587 at commit
|
|
Test build #34043 has finished for PR 6587 at commit
|
|
I agree this change can make things simple and actually I have tried to do it before... But think about |
|
Test build #34056 has finished for PR 6587 at commit
|
|
Test build #34070 has finished for PR 6587 at commit
|
|
Test build #34083 has finished for PR 6587 at commit
|
|
@cloud-fan I've update the |
|
Test build #34150 has finished for PR 6587 at commit
|
|
Test build #34155 has finished for PR 6587 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about this change. Before it, we can create a same tree node(not return the origin one) during tree node transformation which will be regarded as no-change and finally make the batch reach fixed point. However, we can't do this now and may make a batch exceed max iterations which will slow down batch execution dramatically.
cc @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that also before I made this change, but I don't think that the strong reason we should stop this change.
For most of cases in the code, we returns the same references by well-design the rules for a TreeNode object, and if the code still keep creating the identical objects(.equals returns true) in its rule for every iteration, even unnecessary, can this be considered as a bug of the user code?
I think it's the responsibility for user code to decide whether TreeNode object substitutions should be taken (via creating new instance), as user code always knows when a object substitution needed, right? That's also give more freedom for user code to define the .equals() in a semantic way for TreeNode object.
|
I am strongly opposed to these changes. In a previous version of catalyst we had the equality function for |
|
@marmbrus thanks for explanation. For performance concern, I don't think that's a issue! If the user code return a different instance reference from the transformation rule, why should we call its |
|
It's glad to talk with @liancheng offline, I reverted the change for |
|
Test build #35550 has finished for PR 6587 at commit
|
|
Will rebase after #5780 merged. |
|
Test build #35632 has finished for PR 6587 at commit
|
|
Test build #35679 has finished for PR 6587 at commit
|
|
Test build #35681 has finished for PR 6587 at commit
|
|
Test build #35686 has finished for PR 6587 at commit
|
|
Test build #35690 has finished for PR 6587 at commit
|
|
I'm going to again voice my objection here. At the core there is a fundamental problem: we have two types of equality that we care about. Structural equality (i.e. all of the fields of the two classes are the same) and reference equality (these two attributes are referring to the same spot in the input tuple). I believe that it would be confusing to have equals and hash code refer to anything other than structural equality. We cannot get rid of the name part of attribute references (or ignore it in equality) because we are case preserving even when we are case insensitive. So attributes that have different names are different. I don't think that it is too big of a burden for developers to watch for these types of equality and make sure they are applied properly when doing code review. I do think that large refactorings like this are likely to introduce regressions. |
|
Thank you @marmbrus for your patient to explain this again and again. :-) I am complaining some of the unreasonable implementation in the method I just wondering where most likely we will write code like (Previously, I did change the argument lists as below, however, it seems lots of existed code impacted, so I change it back and still overriding the method case class AttributeReference(
name: String,
dataType: DataType,
nullable: Boolean = true,
override val metadata: Metadata = Metadata.empty)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifiers: Seq[String] = Nil) extends Attribute
// V.S.
case class AttributeReference(
val exprId: ExprId = NamedExpression.newExprId,
override val metadata: Metadata = Metadata.empty)(
name: String,
dataType: DataType,
nullable: Boolean = true,
val qualifiers: Seq[String] = Nil
) extends AttributeThus it's not a big burden for developers (by using the |
|
I think that its useful to have two parameter lists here as you often only want to match on a subset of the attributes. That |
|
OK, another potential bug is to check the identity for 2 logical plans(e.g. in CacheManager?), we need update the code for |
|
closing it. |
e.g.
Currently we may not able to make the TreeNode object work with built-in collections as code shows above, because the methods
equalsandhashCodeof case classAttributeReferenceis for literally not semantically in object comparison.rule, as it has afastEqualsguard, we loose the guard by checking the instance reference equality only.