Skip to content

Conversation

@skambha
Copy link
Contributor

@skambha skambha commented Mar 7, 2017

What changes were proposed in this pull request?

The design details is attached to the JIRA issue here

High level overview of the changes are:

  • Enhance the qualifier to be more than one string
  • Add support to store the qualifier. Enhance the lookupRelation to keep the qualifier appropriately.
  • Enhance the table matching column resolution algorithm to account for qualifier being more than a string.
  • Enhance the table matching algorithm in UnresolvedStar.expand
  • Ensure that we continue to support select t1.i1 from db1.t1

How was this patch tested?

  • New tests are added.
  • Several test scenarios were added in a separate test pr 17067. The tests that were not supported earlier are marked with TODO markers and those are now supported with the code changes here.
  • Existing unit tests ( hive, catalyst and sql) were run successfully.

@skambha
Copy link
Contributor Author

skambha commented Mar 7, 2017

cc @gatorsmile, @cloud-fan I'd really appreciate your review and comments. Thanks much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is a valid case. We should treat SELECT ... FROM ... as a 2-levels look up: Level 1, in FROM statement, we specify the table name with an optional qualifier(database name). Level 2, in SELECT statement, we specify the column name with an optional qualifier(table name).

I don't think it makes sense to specify the database name for column name, the table lookup should be done in the FROM statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but @skambha did a comparison between different DB. Basically, what she tried to do is to make Spark SQL consistent with the others. See the design doc:
https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gatorsmile and @cloud-fan.

Hi @cloud-fan,
The use-case is to support fully qualified column name in queries to remove ambiguity in the naming.

For e.g. select db2.t1.i1 from db1.t1, db2.t1 where db1.t1.i2 = db2.t1.i3

Today Spark does not support this. The workaround is to create alias for the table.

Yes, the lookup for the table is coming from the ‘from’ clause. At a high level, changes include keeping track of the qualifier(including the database name) when the lookup happens and enhancing the column resolution logic in the analyzer to resolve fully qualified column name from a logical plan node’s children.
There are more details in the design doc attached to the SPARK-19602

Thanks.

@cloud-fan
Copy link
Contributor

I agree it's a valid use case, do you wanna bring it up to date? sorry for the delay!

@skambha
Copy link
Contributor Author

skambha commented Jan 22, 2018

sure. Let me look into it. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use Seq[String]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the style that exists today. If we change here, then there are other places where it needs to change as well. For e.g, in AttributeReference we initialize it with None today and now it will be a Seq.empty. I am not sure if we want to create these objects or just leave it using None by keeping the Option[]

If you have a strong preference, I can update it. I do have the changes locally but have not pushed that version. Thanks.

@skambha
Copy link
Contributor Author

skambha commented Feb 1, 2018

I have rebased and pushed the changes. I ran the unit tests ( sql, catalyst and hive).
Earlier, I was having issues running the hive test suite locally but that is resolved with the fix from SPARK-23275.

Please take a look. Thanks.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86899 has finished for PR 17185 at commit 8206dc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mafernandez-stratio
Copy link

Any plans of merging this PR?

@gatorsmile
Copy link
Member

@mafernandez-stratio We plan to review and merge after 2.3 release. The target release will be 2.4

@mafernandez-stratio
Copy link

@gatorsmile Great, thanks for the feedback!

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93810 has finished for PR 17185 at commit 8206dc3.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about SELECT db1.t1.* FROM t1 while the current database is db1?

Copy link
Contributor Author

@skambha skambha Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (SELECT db1.t1.* FROM t1) would resolve correctly.

Also a similar scenario (details in Section 2 in Table A # 5 in the design doc)
select db1.t1.i1 from t1 will resolve correctly when the current database is db1. This is tested here

I will explicitly add a test case for the db1.t1.* from t1 as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make SubqueryAlias take a qualifier: Seq[String] instead of alias: String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, SubqueryAlias takes in a AliasIdentifier and not a string.

Also want to mention that: we considered somewhat similar to your suggestion - to simply add a qualifier as a Seq[String] as a member of SubqueryAlias but that change would require touching more files.

I had implemented that first (long time back now) and then after some offline discussions we favored this as this change is more localized and encapsulated. @gatorsmile @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I feel using Seq[String] directly can simplify a lot of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of this change in the recent push. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you briefly explain the new resolution logic? I feel it's a little convoluted now as we are more likely to be ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolution logic needs to account to compare the full qualifier. If you are trying to resolve db1.t1.i1, the match needs to happen against the entire qualifier of the attribute. 


For e.g:

 select db1.t1.i1 from t1   // current database is db1  and lets say t1 has  columns i1, i2. 

When resolving db1.t1.i1, against the following set of Input attributes.
we will have the following attribute references:

  • AttributeReference ( qualifier Seq(db1, t1) and name is i1)
  • AttributeReference ( qualifier Seq(db1, t1) and name is i2)

So the new resolution logic will match the entire qualifier sequence and then match the column ( ie the attribute name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level the changes include:

  • The column resolution needs to account for the qualifier that is now a sequence of string.
  • Enhance the lookupRelation to add the qualifier appropriately. Right now, the qualifier used by the LogicalPlan nodes representing the relation leafnodes use only the table name as the qualifier when no alias is provided. This needs to be updated to add the database name and the table name for permanent views and tables and global temporary views.


Full details of the design and the resolution logic is in the doc here in Section 3.3 and 3.4.

So, Section 2 in the design doc has a lot of scenarios that were tested against different db’s and Spark and the expected behavior with the design here.

If there is a specific scenario you have in mind that would be valid but get flagged as ambiguous, can you please let me know. I’d be glad to look into it.

@skambha
Copy link
Contributor Author

skambha commented Aug 1, 2018

@gatorsmile , @cloud-fan, just a quick comment, I have been working on this and will respond soon.

@skambha
Copy link
Contributor Author

skambha commented Aug 1, 2018

I rebased and found out that the resolution code in Logical plan has changed and it uses map lookup to do the matching. I have some ideas on how to incorporate the 3 part name with the map lookup logic.

For now, I have synced up and bypassed the new map logic and pushed the code so it is up to the latest so I can get the test cycle from GitHub and added todo’s for myself to incorporate the optimized lookup logic.

@cloud-fan, In the meantime, if you have any other comments please let me know. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93904 has finished for PR 17185 at commit 05d6c0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we assume the attribute always carry the full qualifier? and i.e. the analyzer will always include the current database in the qualifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database information gets carried over based on the lookupRelation logic. Details are in Section 3.3.3 in the design doc here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird to use pattern match like this, maybe better to write

nameParts.corresponds(qualifierList)(resolver) || {
  if (nameParts.length == 1 && qualifierList.nonEmpty) {
    resolver(nameParts.head, qualifierList.last)
  } else {
    false
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC users can do select structCol.* from tbl, does it mean the nameParts.head may not be table name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. a seq of 2 elements: database name and table name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@cloud-fan
Copy link
Contributor

overall LGTM, my major concern is how to do O(1) lookup for the 3 part name

@skambha
Copy link
Contributor Author

skambha commented Aug 2, 2018

Thanks @cloud-fan for the review.

I am working on implementing an idea to get optimized lookup with 3part name.

@skambha
Copy link
Contributor Author

skambha commented Aug 3, 2018

retest this please

@gatorsmile
Copy link
Member

retest this please

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to consider if qualifier is Nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will add a defensive check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to check the SQL standard about how to resolve the ambiguity. For a.b.c, it's possible that a is db name, b is table name, c is column name. If there is no column named c, we should fail. It's also possible that a is table name, b is column name, c is nested field name, and it exists. What's the expected behavior here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, if a.b.c fails to resolve as db.table.column, then we check if there is a table and column that matches a.b and then see if c is a nested field name and if it exists, it will resolve to the nested field.

Tests with struct nested fields are here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior defined by SQL standard? IIRC the current behavior without your patch is
col.innerField1.innerField2 can fail even we have a table named col and a column named innerField1 which has a innerField2 field.

Copy link
Contributor Author

@skambha skambha Aug 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it won't fail without this patch too, the col.innerField1.innerField2 will resolve correctly to the struct field if there is a table named col and column named innerField1 which has a innerField2 field.

Please see an example of such a scenario in this test output on master spark:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/columnresolution.sql.out#L360

Let me look into the sql standard and get back on that.

Copy link
Contributor

@cloud-fan cloud-fan Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm talking about is ambiguity. col.innerField1.innerField2 can fail if a table has a col column and the col column has a innerField1 field but the innerField2 doesn't exist. My question is: shall we try all the possible resolution paths and pick the valid one, or define a rule that we can decide the resolution path ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan , Thank you for your suggestion and question.

Existing spark behavior follows precedence rules in the column resolution logic and in this patch we are following the same pattern/rule.

I am looking into the SQL standard to see if there are any column resolution rules but I have not found any yet. However when I researched existing databases, I observed different behaviors among them and it is listed in Section 2/Table A in the design doc here.

I agree, we can improve upon the checks in existing precedence to go all the way to ensure there is a nested field. Although, the user can always qualify the field to resolve the ambiguity. Shall we open another issue to discuss and improve upon the existing resolution logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating AliasIdentifier manually, we can also define a SubqueryAlias.apply which takes db name and table name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me do that if this style is preferred. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 4, 2018

Test build #94206 has finished for PR 17185 at commit 90cd6d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skambha
Copy link
Contributor Author

skambha commented Aug 5, 2018

I have addressed the review comments in this commit here

@cloud-fan, please take a look. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 5, 2018

Test build #94234 has finished for PR 17185 at commit 065687f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2018

Test build #94244 has finished for PR 17185 at commit 065687f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def sql: String = {
val qualifierPrefix = qualifier.map(_ + ".").getOrElse("")
val qualifierPrefix = if (qualifier.nonEmpty) qualifier.mkString(".") + "." else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: val qualifierPrefix = qualifier.mkString("", ".", ".")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for the case when we have Seq.empty. The suffix "." gets returned even for a empty sequence.
For a non empty Seq, the above call will be fine.

Shall we leave the 'if' as is or is there an equivalent preferred style that would work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah my bad, I thought it would return empty string for empty seq. Let's leave it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good.

val key = (qualifier.toLowerCase(Locale.ROOT), name.toLowerCase(Locale.ROOT))
val attributes = collectMatches(name, qualified.get(key)).filter { a =>
resolver(qualifier, a.qualifier.get)
matches = matches match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if (matches._1.isEmpty) {
  matches = ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


def apply(
identifier: String,
database: Option[String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's Option? if no database name, we should call the apply method above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! I'll take care of this in the next push.

@cloud-fan
Copy link
Contributor

LGTM except some code style comment, thanks for working on it!

@skambha
Copy link
Contributor Author

skambha commented Aug 7, 2018

Thanks for the review. I have addressed your comments and pushed the changes.
@cloud-fan, Please take a look.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94335 has finished for PR 17185 at commit 5f7e5d7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94353 has finished for PR 17185 at commit 5f7e5d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in b4bf8be Aug 7, 2018
@gatorsmile
Copy link
Member

@skambha @dilipbiswal How about the hint resolution after supporting multi part names?

@gatorsmile
Copy link
Member

Let us discuss it in the JIRA https://issues.apache.org/jira/browse/SPARK-25121

@dilipbiswal
Copy link
Contributor

Good idea @gatorsmile . Thanks !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants