[SPARK-14554][SQL] disable whole stage codegen if there are too many input columns #12322

cloud-fan · 2016-04-12T03:53:02Z

What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/12047/files#diff-94a1f59bcc9b6758c4ca874652437634R529, we may split field expressions codes in CreateExternalRow to support wide table. However, the whole stage codegen framework doesn't support it, because the input for expressions is not always the input row, but can be CodeGenContext.currentVars, which doesn't work well with CodeGenContext.splitExpressions.

Actually we do have a check to guard against this cases, but it's incomplete, it only checks output fields.

This PR improves the whole stage codegen support check, to disable it if there are too many input fields, so that we can avoid splitting field expressions codes in CreateExternalRow for whole stage codegen.

TODO: Is it a better solution if we can make CodeGenContext.currentVars work well with CodeGenContext.splitExpressions?

How was this patch tested?

new test in DatasetSuite.

cloud-fan · 2016-04-12T03:53:50Z

cc @davies @marmbrus @yhuai

davies · 2016-04-12T04:42:18Z

LGTM, could you update the description?

cloud-fan · 2016-04-12T05:03:31Z

ok updated.

SparkQA · 2016-04-12T05:25:43Z

Test build #55584 has finished for PR 12322 at commit 0ad1194.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-12T05:47:22Z

Test build #55586 has finished for PR 12322 at commit 4d63e15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-12T05:58:07Z

Thanks. Merging to master.

marmbrus · 2016-04-12T17:28:29Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+  test("SPARK-14554: Dataset.map may generate wrong java code for wide table") {
+    val wideDF = sqlContext.range(10).select(Seq.tabulate(1000) {i => ('id + i).as(s"c$i")} : _*)
+    // Make sure the generated code for this plan can compile and execute.
+    wideDF.map(_.getLong(0)).collect()


We should still use checkAnswer here because it provides extra debugging info when there is an exception.

Do you know why this test case is super slow? It took more than 5 minutes to finish it. Is this expected?

- SPARK-14554: Dataset.map may generate wrong java code for wide table (5 minutes, 20 seconds)

See the link: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59079/consoleFull

it's fixed in #13273

## What changes were proposed in this pull request? address this comment: #12322 (comment) ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #12346 from cloud-fan/tmp.

Dataset.map may generate wrong java code for wide table

0ad1194

better fix

4d63e15

cloud-fan changed the title ~~[SPARK-14554][SQL] Dataset.map may generate wrong java code for wide table~~ [SPARK-14554][SQL] disable whole stage codegen if there are too many input columns Apr 12, 2016

asfgit closed this in 52a8011 Apr 12, 2016

marmbrus reviewed Apr 12, 2016
View reviewed changes

cloud-fan mentioned this pull request Apr 13, 2016

[SPARK-14554][SQL][follow-up] use checkDataset to check the result #12346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14554][SQL] disable whole stage codegen if there are too many input columns #12322

[SPARK-14554][SQL] disable whole stage codegen if there are too many input columns #12322

Uh oh!

cloud-fan commented Apr 12, 2016

Uh oh!

cloud-fan commented Apr 12, 2016

Uh oh!

davies commented Apr 12, 2016

Uh oh!

cloud-fan commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

yhuai commented Apr 12, 2016

Uh oh!

marmbrus Apr 12, 2016

Uh oh!

gatorsmile May 21, 2016

Uh oh!

cloud-fan May 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-14554][SQL] disable whole stage codegen if there are too many input columns #12322

[SPARK-14554][SQL] disable whole stage codegen if there are too many input columns #12322

Uh oh!

Conversation

cloud-fan commented Apr 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 12, 2016

Uh oh!

davies commented Apr 12, 2016

Uh oh!

cloud-fan commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

yhuai commented Apr 12, 2016

Uh oh!

marmbrus Apr 12, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 21, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 24, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants