[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain #29942

viirya · 2020-10-04T20:28:54Z

What changes were proposed in this pull request?

This proposes to simplify named_struct + get struct field + from_json expression chain from struct(from_json.col1, from_json.col2, from_json.col3...) to struct(from_json).

Why are the changes needed?

Simplify complex expression tree that could be produced by query optimization or user.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

viirya · 2020-10-04T20:36:48Z

cc @HyukjinKwon @maropu @dongjoon-hyun

SparkQA · 2020-10-04T21:15:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34001/

dongjoon-hyun · 2020-10-04T21:28:53Z

Thank you for pining me, @viirya .

dongjoon-hyun · 2020-10-04T21:29:04Z

cc @sunchao

SparkQA · 2020-10-04T21:39:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34001/

dongjoon-hyun · 2020-10-05T00:59:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

 * The optimization includes:
 * 1. JsonToStructs(StructsToJson(child)) => child.
 * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)


3 -> 3.?

fixed. thanks.

SparkQA · 2020-10-05T01:12:48Z

Test build #129394 has finished for PR 29942 at commit 3eb2947.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-10-05T00:57:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))


nit: _.children(0) -> _.children.head my IDE suggested.

maropu · 2020-10-05T00:57:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+        // alias field names.
+        if (semanticEqual && sameFieldName) {
+          val fromJson = jsonToStructs.head.asInstanceOf[JsonToStructs].copy(schema = c.dataType)
+          val nullFields = c.children.grouped(2).map {


map -> flatMap

maropu · 2020-10-05T01:00:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))


Can this check be merged with L39? https://github.com/apache/spark/pull/29942/files#diff-f9d27e3c9c32aaf07bb038c779309414R39

We can but L39 condition will look ugly.

hm I see. I noticed that it looped c.valExprs {3 x len(c.valExprs)} times to check the condition. Minor optimization though, I thought it would be nice if it could stop early if the condition not met.

Ok, let me change it and see how it looks like.

Moved the condition to top.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

viirya · 2020-10-05T05:58:02Z

Thanks for quick response. Addressed the comments.

SparkQA · 2020-10-05T06:44:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34009/

SparkQA · 2020-10-05T07:05:01Z

Test build #129402 has finished for PR 29942 at commit 849fc50.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-05T07:05:01Z

Test build #129404 has finished for PR 29942 at commit 430d915.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-05T07:09:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34009/

viirya · 2020-10-05T07:11:51Z

retest this please

SparkQA · 2020-10-05T07:31:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34011/

SparkQA · 2020-10-05T07:52:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34012/

SparkQA · 2020-10-05T07:52:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34011/

SparkQA · 2020-10-05T08:08:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34012/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

HyukjinKwon · 2020-10-05T10:45:46Z

@viirya just to clarify, is it to avoid calling the same from_json multiple times? How does it relate to SPARK-32939 and SPARK-32943?

SparkQA · 2020-10-05T11:28:26Z

Test build #129405 has finished for PR 29942 at commit 430d915.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-10-05T19:42:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

 * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.


For a fresh eye with no context this is still a bit confusing - does the list col1, col2 etc have to represent all columns in the json struct?

No, it could be part of the json struct. In the case, we will prune unnecessary columns in JsonToStructs.

sunchao · 2020-10-05T19:43:54Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala

+      .select(namedStruct(
+        "a1", GetStructField(JsonToStructs(schema, options, 'json), 0),
+        "b", GetStructField(JsonToStructs(schema, options, 'json), 1)).as("struct"))
+    val optimized2 = Optimizer.execute(query2.analyze)


seems this is a bit repetitive - perhaps we can create a util method for the comparison? we can test evaluation in the method too.

sunchao · 2020-10-05T19:44:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names and there is not duplicated fields in the struct.


nit: "there is not duplicated fields" -> "there is no duplicated field"

sunchao · 2020-10-05T19:46:17Z

Thanks @dongjoon-hyun for pinging and left some comments @viirya (sorry some comments are stale so pls ignore them).

viirya · 2020-10-05T19:49:41Z

@viirya just to clarify, is it to avoid calling the same from_json multiple times? How does it relate to SPARK-32939 and SPARK-32943?

This patch targets specifically for a special pattern CreateNamedStruct + multiple GetStructField of same JsonToStructs, it could be produced by the optimizer or by users manually.

Sometimes the query optimizer can optimize a query to have many duplicated expressions e.g. JsonToStructs. This is SPARK-32943 wants to fix. It targets a broader problem.

For SPARK-32939, because it was not reported by me, some details I might not get from its description. We don't de-duplicate expressions in whole-stage codegen overall (but only in specified operator). If we disable whole-stage codegen, interpreted Project will de-duplicate expressions for some cases (GenerateUnsafeProjection), but not always (we could also fallback to InterpretedUnsafeProjection possibly). For specified expressions like CaseWhen, we have a chance to de-duplicate the condition expressions, if we want.

SparkQA · 2020-10-05T20:11:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34027/

SparkQA · 2020-10-05T20:28:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34027/

SparkQA · 2020-10-05T23:42:47Z

Test build #129420 has finished for PR 29942 at commit e40118a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-10-06T00:27:29Z

No more comment and it looks okay.

viirya · 2020-10-06T00:39:31Z

Thanks @maropu

SparkQA · 2020-10-06T01:29:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34037/

HyukjinKwon · 2020-10-06T01:43:13Z

No more comments from me too. I am okay with this given that we have a plan for related tickets (#29942 (comment)).

SparkQA · 2020-10-06T01:48:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34037/

viirya · 2020-10-06T01:48:46Z

Thanks @HyukjinKwon

SparkQA · 2020-10-06T05:18:56Z

Test build #129430 has finished for PR 29942 at commit a1b464f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

SparkQA · 2020-10-06T22:32:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34076/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

SparkQA · 2020-10-06T22:50:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34076/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

dongjoon-hyun

+1, LGTM. Thank you, @viirya and all.
Merged to master for Apache Spark 3.1.0 on December 2020.

viirya · 2020-10-06T23:59:47Z

Thanks!

SparkQA · 2020-10-07T00:10:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34084/

SparkQA · 2020-10-07T00:34:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34084/

SparkQA · 2020-10-07T03:00:43Z

Test build #129469 has finished for PR 29942 at commit 2c76a91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-07T04:57:08Z

Test build #129477 has finished for PR 29942 at commit 73320e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…on expression chain ### What changes were proposed in this pull request? This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`. ### Why are the changes needed? Simplify complex expression tree that could be produced by query optimization or user. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#29942 from viirya/SPARK-33007. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Simplify named_struct + from_json.

3eb2947

dongjoon-hyun reviewed Oct 5, 2020

View reviewed changes

maropu reviewed Oct 5, 2020

View reviewed changes

Address comments.

849fc50

Move condition to pattern condition.

430d915

HyukjinKwon reviewed Oct 5, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 5, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Outdated Show resolved Hide resolved

Comment style.

e40118a

sunchao reviewed Oct 5, 2020

View reviewed changes

Address comments.

a1b464f

dongjoon-hyun reviewed Oct 6, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Outdated Show resolved Hide resolved

Fix comment.

2c76a91

dongjoon-hyun reviewed Oct 6, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Show resolved Hide resolved

dongjoon-hyun reviewed Oct 6, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Show resolved Hide resolved

Improve comment.

73320e8

dongjoon-hyun approved these changes Oct 6, 2020

View reviewed changes

dongjoon-hyun closed this in 57ed5a8 Oct 6, 2020

viirya deleted the SPARK-33007 branch December 27, 2023 18:28

[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain #29942

[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain #29942

Uh oh!

Conversation

viirya commented Oct 4, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Oct 4, 2020

Uh oh!

SparkQA commented Oct 4, 2020

Uh oh!

dongjoon-hyun commented Oct 4, 2020

Uh oh!

dongjoon-hyun commented Oct 4, 2020

Uh oh!

SparkQA commented Oct 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

viirya commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

viirya commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Oct 5, 2020

Uh oh!

SparkQA commented Oct 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!