Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 4, 2020

What changes were proposed in this pull request?

This proposes to simplify named_struct + get struct field + from_json expression chain from struct(from_json.col1, from_json.col2, from_json.col3...) to struct(from_json).

Why are the changes needed?

Simplify complex expression tree that could be produced by query optimization or user.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

@viirya
Copy link
Member Author

viirya commented Oct 4, 2020

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34001/

@dongjoon-hyun
Copy link
Member

Thank you for pining me, @viirya .

@dongjoon-hyun
Copy link
Member

cc @sunchao

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34001/

* The optimization includes:
* 1. JsonToStructs(StructsToJson(child)) => child.
* 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
* 3 struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 -> 3.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks.

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Test build #129394 has finished for PR 29942 at commit 3eb2947.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case c: CreateNamedStruct
if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
val jsonToStructs = c.valExprs.map(_.children(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _.children(0) -> _.children.head my IDE suggested.

// alias field names.
if (semanticEqual && sameFieldName) {
val fromJson = jsonToStructs.head.asInstanceOf[JsonToStructs].copy(schema = c.dataType)
val nullFields = c.children.grouped(2).map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map -> flatMap

if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
val jsonToStructs = c.valExprs.map(_.children(0))
val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can but L39 condition will look ugly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm I see. I noticed that it looped c.valExprs {3 x len(c.valExprs)} times to check the condition. Minor optimization though, I thought it would be nice if it could stop early if the condition not met.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me change it and see how it looks like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the condition to top.

@viirya
Copy link
Member Author

viirya commented Oct 5, 2020

Thanks for quick response. Addressed the comments.

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34009/

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Test build #129402 has finished for PR 29942 at commit 849fc50.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Test build #129404 has finished for PR 29942 at commit 430d915.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34009/

@viirya
Copy link
Member Author

viirya commented Oct 5, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34011/

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34012/

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34011/

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34012/

@HyukjinKwon
Copy link
Member

@viirya just to clarify, is it to avoid calling the same from_json multiple times? How does it relate to SPARK-32939 and SPARK-32943?

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Test build #129405 has finished for PR 29942 at commit 430d915.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
* 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
* CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
* fields of CreateNamedStruct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a fresh eye with no context this is still a bit confusing - does the list col1, col2 etc have to represent all columns in the json struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it could be part of the json struct. In the case, we will prune unnecessary columns in JsonToStructs.

.select(namedStruct(
"a1", GetStructField(JsonToStructs(schema, options, 'json), 0),
"b", GetStructField(JsonToStructs(schema, options, 'json), 1)).as("struct"))
val optimized2 = Optimizer.execute(query2.analyze)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this is a bit repetitive - perhaps we can create a util method for the comparison? we can test evaluation in the method too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length

// If we create struct from various fields of the same `JsonToStructs` and we don't
// alias field names and there is not duplicated fields in the struct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "there is not duplicated fields" -> "there is no duplicated field"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@sunchao
Copy link
Member

sunchao commented Oct 5, 2020

Thanks @dongjoon-hyun for pinging and left some comments @viirya (sorry some comments are stale so pls ignore them).

@viirya
Copy link
Member Author

viirya commented Oct 5, 2020

@viirya just to clarify, is it to avoid calling the same from_json multiple times? How does it relate to SPARK-32939 and SPARK-32943?

This patch targets specifically for a special pattern CreateNamedStruct + multiple GetStructField of same JsonToStructs, it could be produced by the optimizer or by users manually.

Sometimes the query optimizer can optimize a query to have many duplicated expressions e.g. JsonToStructs. This is SPARK-32943 wants to fix. It targets a broader problem.

For SPARK-32939, because it was not reported by me, some details I might not get from its description. We don't de-duplicate expressions in whole-stage codegen overall (but only in specified operator). If we disable whole-stage codegen, interpreted Project will de-duplicate expressions for some cases (GenerateUnsafeProjection), but not always (we could also fallback to InterpretedUnsafeProjection possibly). For specified expressions like CaseWhen, we have a chance to de-duplicate the condition expressions, if we want.

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34027/

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34027/

@SparkQA
Copy link

SparkQA commented Oct 5, 2020

Test build #129420 has finished for PR 29942 at commit e40118a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 6, 2020

No more comment and it looks okay.

@viirya
Copy link
Member Author

viirya commented Oct 6, 2020

Thanks @maropu

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34037/

@HyukjinKwon
Copy link
Member

No more comments from me too. I am okay with this given that we have a plan for related tickets (#29942 (comment)).

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34037/

@viirya
Copy link
Member Author

viirya commented Oct 6, 2020

Thanks @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Test build #129430 has finished for PR 29942 at commit a1b464f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34076/

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34076/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @viirya and all.
Merged to master for Apache Spark 3.1.0 on December 2020.

@viirya
Copy link
Member Author

viirya commented Oct 6, 2020

Thanks!

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34084/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34084/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129469 has finished for PR 29942 at commit 2c76a91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129477 has finished for PR 29942 at commit 73320e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…on expression chain

### What changes were proposed in this pull request?

This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`.

### Why are the changes needed?

Simplify complex expression tree that could be produced by query optimization or user.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#29942 from viirya/SPARK-33007.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya viirya deleted the SPARK-33007 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants