-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19602][SQL] Support column resolution of fully qualified column name ( 3 part name) #17185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @gatorsmile, @cloud-fan I'd really appreciate your review and comments. Thanks much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is a valid case. We should treat SELECT ... FROM ... as a 2-levels look up: Level 1, in FROM statement, we specify the table name with an optional qualifier(database name). Level 2, in SELECT statement, we specify the column name with an optional qualifier(table name).
I don't think it makes sense to specify the database name for column name, the table lookup should be done in the FROM statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but @skambha did a comparison between different DB. Basically, what she tried to do is to make Spark SQL consistent with the others. See the design doc:
https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gatorsmile and @cloud-fan.
Hi @cloud-fan,
The use-case is to support fully qualified column name in queries to remove ambiguity in the naming.
For e.g. select db2.t1.i1 from db1.t1, db2.t1 where db1.t1.i2 = db2.t1.i3
Today Spark does not support this. The workaround is to create alias for the table.
Yes, the lookup for the table is coming from the ‘from’ clause. At a high level, changes include keeping track of the qualifier(including the database name) when the lookup happens and enhancing the column resolution logic in the analyzer to resolve fully qualified column name from a logical plan node’s children.
There are more details in the design doc attached to the SPARK-19602
Thanks.
|
I agree it's a valid use case, do you wanna bring it up to date? sorry for the delay! |
|
sure. Let me look into it. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use Seq[String]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the style that exists today. If we change here, then there are other places where it needs to change as well. For e.g, in AttributeReference we initialize it with None today and now it will be a Seq.empty. I am not sure if we want to create these objects or just leave it using None by keeping the Option[]
If you have a strong preference, I can update it. I do have the changes locally but have not pushed that version. Thanks.
dc69a85 to
8206dc3
Compare
|
I have rebased and pushed the changes. I ran the unit tests ( sql, catalyst and hive). Please take a look. Thanks. |
|
ok to test |
|
Test build #86899 has finished for PR 17185 at commit
|
|
Any plans of merging this PR? |
|
@mafernandez-stratio We plan to review and merge after 2.3 release. The target release will be 2.4 |
|
@gatorsmile Great, thanks for the feedback! |
|
ok to test |
|
Test build #93810 has finished for PR 17185 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about SELECT db1.t1.* FROM t1 while the current database is db1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (SELECT db1.t1.* FROM t1) would resolve correctly.
Also a similar scenario (details in Section 2 in Table A # 5 in the design doc)
select db1.t1.i1 from t1 will resolve correctly when the current database is db1. This is tested here
I will explicitly add a test case for the db1.t1.* from t1 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make SubqueryAlias take a qualifier: Seq[String] instead of alias: String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, SubqueryAlias takes in a AliasIdentifier and not a string.
Also want to mention that: we considered somewhat similar to your suggestion - to simply add a qualifier as a Seq[String] as a member of SubqueryAlias but that change would require touching more files.
I had implemented that first (long time back now) and then after some offline discussions we favored this as this change is more localized and encapsulated. @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I feel using Seq[String] directly can simplify a lot of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took care of this change in the recent push. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you briefly explain the new resolution logic? I feel it's a little convoluted now as we are more likely to be ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resolution logic needs to account to compare the full qualifier. If you are trying to resolve db1.t1.i1, the match needs to happen against the entire qualifier of the attribute.
For e.g:
select db1.t1.i1 from t1 // current database is db1 and lets say t1 has columns i1, i2. When resolving db1.t1.i1, against the following set of Input attributes.
we will have the following attribute references:
- AttributeReference ( qualifier Seq(db1, t1) and name is i1)
- AttributeReference ( qualifier Seq(db1, t1) and name is i2)
So the new resolution logic will match the entire qualifier sequence and then match the column ( ie the attribute name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level the changes include:
- The column resolution needs to account for the qualifier that is now a sequence of string.
- Enhance the lookupRelation to add the qualifier appropriately. Right now, the qualifier used by the LogicalPlan nodes representing the relation leafnodes use only the table name as the qualifier when no alias is provided. This needs to be updated to add the database name and the table name for permanent views and tables and global temporary views.
Full details of the design and the resolution logic is in the doc here in Section 3.3 and 3.4.
So, Section 2 in the design doc has a lot of scenarios that were tested against different db’s and Spark and the expected behavior with the design here.
If there is a specific scenario you have in mind that would be valid but get flagged as ambiguous, can you please let me know. I’d be glad to look into it.
|
@gatorsmile , @cloud-fan, just a quick comment, I have been working on this and will respond soon. |
|
I rebased and found out that the resolution code in Logical plan has changed and it uses map lookup to do the matching. I have some ideas on how to incorporate the 3 part name with the map lookup logic. For now, I have synced up and bypassed the new map logic and pushed the code so it is up to the latest so I can get the test cycle from GitHub and added todo’s for myself to incorporate the optimized lookup logic. @cloud-fan, In the meantime, if you have any other comments please let me know. Thanks. |
|
Test build #93904 has finished for PR 17185 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we assume the attribute always carry the full qualifier? and i.e. the analyzer will always include the current database in the qualifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The database information gets carried over based on the lookupRelation logic. Details are in Section 3.3.3 in the design doc here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird to use pattern match like this, maybe better to write
nameParts.corresponds(qualifierList)(resolver) || {
if (nameParts.length == 1 && qualifierList.nonEmpty) {
resolver(nameParts.head, qualifierList.last)
} else {
false
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC users can do select structCol.* from tbl, does it mean the nameParts.head may not be table name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- a seq of 2 elements: database name and table name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
overall LGTM, my major concern is how to do O(1) lookup for the 3 part name |
|
Thanks @cloud-fan for the review. I am working on implementing an idea to get optimized lookup with 3part name. |
|
retest this please |
|
retest this please |
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to consider if qualifier is Nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will add a defensive check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to check the SQL standard about how to resolve the ambiguity. For a.b.c, it's possible that a is db name, b is table name, c is column name. If there is no column named c, we should fail. It's also possible that a is table name, b is column name, c is nested field name, and it exists. What's the expected behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, if a.b.c fails to resolve as db.table.column, then we check if there is a table and column that matches a.b and then see if c is a nested field name and if it exists, it will resolve to the nested field.
Tests with struct nested fields are here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior defined by SQL standard? IIRC the current behavior without your patch is
col.innerField1.innerField2 can fail even we have a table named col and a column named innerField1 which has a innerField2 field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it won't fail without this patch too, the col.innerField1.innerField2 will resolve correctly to the struct field if there is a table named col and column named innerField1 which has a innerField2 field.
Please see an example of such a scenario in this test output on master spark:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/columnresolution.sql.out#L360
Let me look into the sql standard and get back on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm talking about is ambiguity. col.innerField1.innerField2 can fail if a table has a col column and the col column has a innerField1 field but the innerField2 doesn't exist. My question is: shall we try all the possible resolution paths and pick the valid one, or define a rule that we can decide the resolution path ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan , Thank you for your suggestion and question.
Existing spark behavior follows precedence rules in the column resolution logic and in this patch we are following the same pattern/rule.
I am looking into the SQL standard to see if there are any column resolution rules but I have not found any yet. However when I researched existing databases, I observed different behaviors among them and it is listed in Section 2/Table A in the design doc here.
I agree, we can improve upon the checks in existing precedence to go all the way to ensure there is a nested field. Although, the user can always qualify the field to resolve the ambiguity. Shall we open another issue to discuss and improve upon the existing resolution logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of creating AliasIdentifier manually, we can also define a SubqueryAlias.apply which takes db name and table name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me do that if this style is preferred. Thanks.
|
Test build #94206 has finished for PR 17185 at commit
|
Changes to resolve fully qualified column attribute references from a logical plan node children
…rt name and remove my todos
|
I have addressed the review comments in this commit here @cloud-fan, please take a look. Thanks. |
|
Test build #94234 has finished for PR 17185 at commit
|
|
retest this please |
|
Test build #94244 has finished for PR 17185 at commit
|
|
|
||
| override def sql: String = { | ||
| val qualifierPrefix = qualifier.map(_ + ".").getOrElse("") | ||
| val qualifierPrefix = if (qualifier.nonEmpty) qualifier.mkString(".") + "." else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: val qualifierPrefix = qualifier.mkString("", ".", ".")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for the case when we have Seq.empty. The suffix "." gets returned even for a empty sequence.
For a non empty Seq, the above call will be fine.
Shall we leave the 'if' as is or is there an equivalent preferred style that would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah my bad, I thought it would return empty string for empty seq. Let's leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good.
| val key = (qualifier.toLowerCase(Locale.ROOT), name.toLowerCase(Locale.ROOT)) | ||
| val attributes = collectMatches(name, qualified.get(key)).filter { a => | ||
| resolver(qualifier, a.qualifier.get) | ||
| matches = matches match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (matches._1.isEmpty) {
matches = ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
|
||
| def apply( | ||
| identifier: String, | ||
| database: Option[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's Option? if no database name, we should call the apply method above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! I'll take care of this in the next push.
|
LGTM except some code style comment, thanks for working on it! |
|
Thanks for the review. I have addressed your comments and pushed the changes. |
|
Test build #94335 has finished for PR 17185 at commit
|
|
retest this please |
|
Test build #94353 has finished for PR 17185 at commit
|
|
thanks, merging to master! |
|
@skambha @dilipbiswal How about the hint resolution after supporting multi part names? |
|
Let us discuss it in the JIRA https://issues.apache.org/jira/browse/SPARK-25121 |
|
Good idea @gatorsmile . Thanks !! |
What changes were proposed in this pull request?
The design details is attached to the JIRA issue here
High level overview of the changes are:
How was this patch tested?