[SPARK-13135][SQL] Don't print expressions recursively in generated code #13192

dongjoon-hyun · 2016-05-19T08:12:04Z

What changes were proposed in this pull request?

This PR is an up-to-date and a little bit improved version of #11019 of @rxin for

(1) preventing recursive printing of expressions in generated code.

Since the major function of this PR is indeed the above, he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation.

(2) Improve multiline comment indentation.
(3) Reduce the number of empty lines (mainly consecutive empty lines).
(4) Remove all space characters on empty lines.

Example

spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)

Before

Generated code:
/* 001 */ public Object generate(Object[] references) {
...
/* 005 */ /**
/* 006 */ * Codegend pipeline for
/* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 008 */ * +- Range 1, 1, 8, 999, [id#0L]
/* 009 */ */
...
/* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 076 */     
/* 077 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
/* 078 */     
/* 079 */     // initialize Range
...
/* 092 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 093 */       
/* 094 */       // CONSUME: WholeStageCodegen
/* 095 */       
/* 096 */       // (((input[0, bigint, false] + 1) + 2) + 3)
/* 097 */       // ((input[0, bigint, false] + 1) + 2)
/* 098 */       // (input[0, bigint, false] + 1)
...
/* 107 */       // (((input[0, bigint, false] + 4) + 5) + 6)
/* 108 */       // ((input[0, bigint, false] + 4) + 5)
/* 109 */       // (input[0, bigint, false] + 4)
...
/* 126 */ }

After

Generated code:
/* 001 */ public Object generate(Object[] references) {
...
/* 005 */ /**
/* 006 */  * Codegend pipeline for
/* 007 */  * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 008 */  * +- Range 1, 1, 8, 999, [id#0L]
/* 009 */  */
...
/* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 076 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
/* 077 */     // initialize Range
...
/* 090 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 091 */       // CONSUME: WholeStageCodegen
/* 092 */       // (((input[0, bigint, false] + 1) + 2) + 3)
...
/* 101 */       // (((input[0, bigint, false] + 4) + 5) + 6)
...
/* 118 */ }

How was this patch tested?

Pass the Jenkins tests and see the result of the following command manually.

scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen()

Author: Dongjoon Hyun [email protected]
Author: Reynold Xin [email protected]

dongjoon-hyun · 2016-05-19T08:16:51Z

Hi, @rxin .
This is the first attempt according to your request.
I removed some obsolete code in #11019 in order to pass the tests.
Please let me know if there is something I missed mistakenly.

cc @cloud-fan @nongli

rxin · 2016-05-19T09:22:58Z

cc @sameeragarwal / @davies

SparkQA · 2016-05-19T09:34:22Z

Test build #58855 has finished for PR 13192 at commit ea7de3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-19T16:41:41Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala

We can't remove the empty lines here, or LINENO of the compiled code will be different than the formatted code.

Thank you for review, @davies .
Oh, I thought CodeFormatter.format is called before Janino and Guava loading cache, too.
I'll make that consistent in this afternoon. If then, it'll be okay.

dongjoon-hyun · 2016-05-20T07:02:52Z

As @rxin told, what was really needed is removing overlapping comments.
So, I rethink about that and revert the change on Expression.gen which removes the code field.
code fields have their own meaning and are still valuable.
Instead, I can achieve that goal simply by adding CodeFormatter.stripOverlappingComments.
Also, I updated the description of this PR, too.

SparkQA · 2016-05-20T08:20:50Z

Test build #58960 has finished for PR 13192 at commit 5ffb249.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-20T08:35:13Z

Retest this please

SparkQA · 2016-05-20T09:48:43Z

Test build #58971 has finished for PR 13192 at commit 5ffb249.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-20T15:46:42Z

The PySpark failure is fixed as a HOTFIX.

SparkQA · 2016-05-20T17:22:21Z

Test build #59005 has finished for PR 13192 at commit 2018d9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-20T17:25:18Z

Hi, @davies .
It's ready for review, again!

davies · 2016-05-20T18:05:28Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala

After #12979 is merged, this may not work now.

davies · 2016-05-20T18:07:22Z

@dongjoon-hyun Maybe we could have a method Expression.genCodeWithComment() that is used by generated projections and operators, Expression.genCode() called by other Expressions will not have comment in it. This requires change more places, not sure it's a good idea or not.

dongjoon-hyun · 2016-05-20T18:24:25Z

Ya. There were huge changes. I've saw the PR before, but I didn't consider that in this PR.
My bad. Let me think how to solve the original goal with new master branch.
Thank you, @davies .

dongjoon-hyun · 2016-05-20T23:21:53Z

Hi, @davies and @rxin .
I updated the code and description again according to the current master.

SparkQA · 2016-05-21T00:15:59Z

Test build #59039 has finished for PR 13192 at commit 8257e78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-22T19:40:43Z

Hi, @davies and @rxin .
Could you review this PR again when you have some time?

davies · 2016-05-24T05:58:47Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala

+      val line = l.trim()
+      val skip = lastLine.startsWith("/*") && lastLine.endsWith("*/") &&
+        line.startsWith("/*") && line.endsWith("*/") &&
+        map(lastLine).substring(3).contains(map(line).substring(3))


Have you check that this actually work? I think we have placeholders here so will not find any duplicated comments to skip.

Oh, it should work, I missed the map. Will it have performance issue?

I think it's okay for the performance.

This function is used for at every CodeAndComment creation once.

It scans codeAndComment.body once.

Map lookup occurs on each line at most twice. Also, it does not cost much in this case.

Also, the skip condition is checking only consecutive comments lines.
If there is something to do more, please let me know, @davies .

davies · 2016-05-24T17:06:46Z

LGTM,
Merging this into master and 2.0, thanks!

…code ## What changes were proposed in this pull request? This PR is an up-to-date and a little bit improved version of #11019 of rxin for - (1) preventing recursive printing of expressions in generated code. Since the major function of this PR is indeed the above, he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation. - (2) Improve multiline comment indentation. - (3) Reduce the number of empty lines (mainly consecutive empty lines). - (4) Remove all space characters on empty lines. **Example** ```scala spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6) ``` **Before** ``` Generated code: /* 001 */ public Object generate(Object[] references) { ... /* 005 */ /** /* 006 */ * Codegend pipeline for /* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 */ * +- Range 1, 1, 8, 999, [id#0L] /* 009 */ */ ... /* 075 */ // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 076 */ /* 077 */ // PRODUCE: Range 1, 1, 8, 999, [id#0L] /* 078 */ /* 079 */ // initialize Range ... /* 092 */ // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 093 */ /* 094 */ // CONSUME: WholeStageCodegen /* 095 */ /* 096 */ // (((input[0, bigint, false] + 1) + 2) + 3) /* 097 */ // ((input[0, bigint, false] + 1) + 2) /* 098 */ // (input[0, bigint, false] + 1) ... /* 107 */ // (((input[0, bigint, false] + 4) + 5) + 6) /* 108 */ // ((input[0, bigint, false] + 4) + 5) /* 109 */ // (input[0, bigint, false] + 4) ... /* 126 */ } ``` **After** ``` Generated code: /* 001 */ public Object generate(Object[] references) { ... /* 005 */ /** /* 006 */ * Codegend pipeline for /* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 */ * +- Range 1, 1, 8, 999, [id#0L] /* 009 */ */ ... /* 075 */ // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 076 */ // PRODUCE: Range 1, 1, 8, 999, [id#0L] /* 077 */ // initialize Range ... /* 090 */ // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 091 */ // CONSUME: WholeStageCodegen /* 092 */ // (((input[0, bigint, false] + 1) + 2) + 3) ... /* 101 */ // (((input[0, bigint, false] + 4) + 5) + 6) ... /* 118 */ } ``` ## How was this patch tested? Pass the Jenkins tests and see the result of the following command manually. ```scala scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen() ``` Author: Dongjoon Hyun <dongjoonapache.org> Author: Reynold Xin <rxindatabricks.com> Author: Dongjoon Hyun <[email protected]> Closes #13192 from dongjoon-hyun/SPARK-13135.

dongjoon-hyun · 2016-05-24T17:26:46Z

Thank you, @davies !

cloud-fan · 2016-05-24T17:47:12Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala

+    var lastLine: String = "dummy"
+    codeAndComment.body.split('\n').foreach { l =>
+      val line = l.trim()
+      val skip = lastLine.startsWith("/*") && lastLine.endsWith("*/") &&


are we assuming the comment holder will always take an entire line?

davies reviewed May 19, 2016
View reviewed changes

davies reviewed May 20, 2016
View reviewed changes

[SPARK-13135][SQL] Don't print expressions recursively in generated code

8257e78

davies reviewed May 24, 2016
View reviewed changes

asfgit closed this in f8763b8 May 24, 2016

cloud-fan reviewed May 24, 2016
View reviewed changes

dongjoon-hyun deleted the SPARK-13135 branch July 20, 2016 07:35

[SPARK-13135][SQL] Don't print expressions recursively in generated code #13192

[SPARK-13135][SQL] Don't print expressions recursively in generated code #13192

Uh oh!

Conversation

dongjoon-hyun commented May 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented May 19, 2016

Uh oh!

rxin commented May 19, 2016

Uh oh!

SparkQA commented May 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 20, 2016

Uh oh!

dongjoon-hyun commented May 20, 2016

Uh oh!

SparkQA commented May 20, 2016

Uh oh!

dongjoon-hyun commented May 20, 2016

Uh oh!

SparkQA commented May 20, 2016

Uh oh!

dongjoon-hyun commented May 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented May 20, 2016

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

dongjoon-hyun commented May 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented May 24, 2016

Uh oh!

dongjoon-hyun commented May 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented May 19, 2016 •

edited

Loading

dongjoon-hyun commented May 20, 2016 •

edited

Loading

davies commented May 20, 2016 •

edited

Loading

dongjoon-hyun commented May 20, 2016 •

edited

Loading

dongjoon-hyun May 24, 2016 •

edited

Loading