-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6145] [SQL] Fix the bug of nested data type resolving in ORDER BY #4892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28268 has started for PR 4892 at commit
|
|
cc @marmbrus @cloud-fan This is a quick fix, in long term, we should resolve the nested attribute sequence in one time. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val i1="a"
val i2="b"
val rest:Seq[String]=Nil
println(i1 + "." + i2 + rest.mkString(".", ".", ""))outputs a.b., what we expect is a.b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my mistake...didn't test the mkString method on Nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
UnresolvedAttribute(i1 + "." + i2 + rest.mkString("."))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val i1="a"
val i2="b"
val rest:Seq[String]="c" :: Nil
println(i1 + "." + i2 + rest.mkString("."))Outputs a.bc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That make sense, thanks!
|
Test build #28268 has finished for PR 4892 at commit
|
|
Test PASSed. |
|
Hmm, this solves some problems, but not all of them: sqlContext.jsonRDD(sc.parallelize("""{"a": {"a": {"a": 1}}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder")
sqlContext.sql("SELECT a.a.a FROM nestedOrder ORDER BY a.a.a")
org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type LongType;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveGetField(Analyzer.scala:307)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:271)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) |
|
Test build #28273 has started for PR 4892 at commit
|
|
Test build #28273 has finished for PR 4892 at commit
|
|
Test FAILed. |
|
Test build #28276 has started for PR 4892 at commit
|
|
Thank you @marmbrus for the review, we do need to specify an alias for the nested attribute expression, but should not be the last attribute name. The code is updated, let's see if it break anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this function into the LogicalPlan
|
Test build #28276 has finished for PR 4892 at commit
|
|
Test FAILed. |
|
Test build #28279 has started for PR 4892 at commit
|
|
Test build #28279 has finished for PR 4892 at commit
|
|
Test FAILed. |
|
Hi @marmbrus @chenghao-intel , I have a simpler fix here |
|
Test build #28284 has started for PR 4892 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python test failure is caused by replacing aliasName with name here. Is it okay? SELECT a.b.c FROM table would get attribute named a.b.c instead of c before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @viirya I've updated the python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant should it be that? In Hive it should be c instead of a.b.c?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure how Hive handle that, but it can not be c; otherwise it may cause reference arbitrary for its parent logical plan.
e.g.
Assume we have table tbl with schema Struct < a : Struct < b : Int, c: Int>, b: int>
SELECT b FROM (SELECT a.b, b FROM tbl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can change the default alias when extracting nested fields. I believe we match hive behaviors now, and this would break existing queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree we shouldn't break the existed logic, but I believe this is a bug of Hive.
hive>create table struct1 as select named_struct("a",key, "b", value) as a, key as b from src limit 1;
hive>select a.b, b from struct1; -- Works
hive>create table struct2 as select a.b, b from struct1;
FAILED: SemanticException [Error 10036]: Duplicate column name: b
I am wondering if we can break the naming rule of Hive for nested data type references, which always causes ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this a bug? These are pretty contrived examples. How often do you actually have nested structures where the outside name is the same as the inside name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I shouldn't say "always", but "possible", it maybe quite often while with join.
|
Test build #28284 has finished for PR 4892 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of resolved is Seq[NamedExpression], so we don't need call collect here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Point, I will update that.
|
Hi @chenghao-intel , have you considered about something like |
|
@cloud-fan I don't think we need to consider the I was thinking we probably can remove the class |
|
Test build #28292 has started for PR 4892 at commit
|
|
Hi @chenghao-intel , I just ran a quick test, it failed. Actually the |
|
Test build #28292 has finished for PR 4892 at commit
|
|
Test PASSed. |
|
I am closing this PR. |
No description provided.