Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Mar 4, 2015

Test build #28268 has started for PR 4892 at commit b82c8c5.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

cc @marmbrus @cloud-fan This is a quick fix, in long term, we should resolve the nested attribute sequence in one time.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val i1="a"
val i2="b"
val rest:Seq[String]=Nil
println(i1 + "." + i2 + rest.mkString(".", ".", ""))

outputs a.b., what we expect is a.b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my mistake...didn't test the mkString method on Nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about
UnresolvedAttribute(i1 + "." + i2 + rest.mkString("."))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val i1="a"
val i2="b"
val rest:Seq[String]="c" :: Nil
println(i1 + "." + i2 + rest.mkString("."))

Outputs a.bc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense, thanks!

@SparkQA
Copy link

SparkQA commented Mar 4, 2015

Test build #28268 has finished for PR 4892 at commit b82c8c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28268/
Test PASSed.

@marmbrus
Copy link
Contributor

marmbrus commented Mar 4, 2015

Hmm, this solves some problems, but not all of them:

sqlContext.jsonRDD(sc.parallelize("""{"a": {"a": {"a": 1}}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder")

sqlContext.sql("SELECT a.a.a FROM nestedOrder ORDER BY a.a.a")


org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type LongType;
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveGetField(Analyzer.scala:307)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:271)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28273 has started for PR 4892 at commit 44ac17d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28273 has finished for PR 4892 at commit 44ac17d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28273/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28276 has started for PR 4892 at commit 47db754.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus for the review, we do need to specify an alias for the nested attribute expression, but should not be the last attribute name. The code is updated, let's see if it break anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this function into the LogicalPlan

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28276 has finished for PR 4892 at commit 47db754.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28276/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28279 has started for PR 4892 at commit 5de3b9e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28279 has finished for PR 4892 at commit 5de3b9e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28279/
Test FAILed.

@cloud-fan
Copy link
Contributor

Hi @marmbrus @chenghao-intel , I have a simpler fix here

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28284 has started for PR 4892 at commit 73eb346.

  • This patch merges cleanly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python test failure is caused by replacing aliasName with name here. Is it okay? SELECT a.b.c FROM table would get attribute named a.b.c instead of c before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @viirya I've updated the python code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant should it be that? In Hive it should be c instead of a.b.c?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure how Hive handle that, but it can not be c; otherwise it may cause reference arbitrary for its parent logical plan.

e.g.
Assume we have table tbl with schema Struct < a : Struct < b : Int, c: Int>, b: int>

SELECT b FROM (SELECT a.b, b FROM tbl)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can change the default alias when extracting nested fields. I believe we match hive behaviors now, and this would break existing queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree we shouldn't break the existed logic, but I believe this is a bug of Hive.

hive>create table struct1 as select named_struct("a",key, "b", value) as a, key as b from src limit 1;
hive>select a.b, b from struct1; -- Works
hive>create table struct2 as select a.b, b from struct1;
FAILED: SemanticException [Error 10036]: Duplicate column name: b

I am wondering if we can break the naming rule of Hive for nested data type references, which always causes ambiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a bug? These are pretty contrived examples. How often do you actually have nested structures where the outside name is the same as the inside name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I shouldn't say "always", but "possible", it maybe quite often while with join.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28284 has finished for PR 4892 at commit 73eb346.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28284/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of resolved is Seq[NamedExpression], so we don't need call collect here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point, I will update that.

@cloud-fan
Copy link
Contributor

Hi @chenghao-intel , have you considered about something like a.b[0].c? It will be parsed into UnresolvedGetField(GetItem(UnresolvedAttribute("a.b"), 0), "c"), so you can't resolve the GetField chain in one time at LogicalPlan#resolve

@chenghao-intel
Copy link
Contributor Author

@cloud-fan I don't think we need to consider the GetItem here, See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala#L374

I was thinking we probably can remove the class UnresolvedGetField also, but keep it as future improvement, that will be a big change.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28292 has started for PR 4892 at commit 9209ac1.

  • This patch merges cleanly.

@cloud-fan
Copy link
Contributor

Hi @chenghao-intel , I just ran a quick test, it failed.

  test("quick") {
    jsonRDD(sparkContext.makeRDD(
      """{"a": [{"b": 1}], "c0": {"a": 1}}""" :: Nil)).registerTempTable("t")
    sql("SELECT a[0].b FROM t ORDER BY c0.a").queryExecution.analyzed
  }

Actually the GetItem matters, it will make the alias name of a.a[0].a become "c0"(auto generated). Otherwise, with your fix, the alias name of a.a.a will be "a.a.a"

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28292 has finished for PR 4892 at commit 9209ac1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28292/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

I am closing this PR.

@chenghao-intel chenghao-intel deleted the orderby branch July 2, 2015 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants