[SPARK-13404] [SQL] Create variables for input row when it's actually used #11274

davies · 2016-02-19T18:01:46Z

What changes were proposed in this pull request?

This PR change the way how we generate the code for the output variables passing from a plan to it's parent.

Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted.

This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by consume(). In parent.consumeChild(), it will check the output from child and usedInputs, generate the code for those columns that is part of usedInputs before calling doConsume().

This PR also change the if from

if (cond) {
  xxx
}

to

if (!cond) continue;
xxx

The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin.

It also added some comments for operators.

How was the this patch tested?

Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s)

davies · 2016-02-19T18:44:21Z

This is generated code for Q55:

select i_brand_id brand_id, i_brand brand,
    sum(ss_ext_sales_price) ext_price
 from date_dim, store_sales, item
 where d_date_sk = ss_sold_date_sk
    and ss_item_sk = i_item_sk
    and i_manager_id=28
    and d_moy=11
    and d_year=1999
 group by i_brand, i_brand_id
 order by ext_price desc, brand_id
 limit 100

https://gist.github.com/davies/b68bf530eefa9d15b43e

SparkQA · 2016-02-19T19:24:33Z

Test build #51564 has finished for PR 11274 at commit 8fd2ceb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-19T20:18:48Z

Test build #51569 has finished for PR 11274 at commit c92e457.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

davies · 2016-02-22T22:07:18Z

This PR make codegen harder to understand (easy to be wrong), the performance improvements is not that much currently (maybe because the generated part does not take a majority part), I'd like to hold this PR for a while.

SparkQA · 2016-02-22T22:26:15Z

Test build #51668 has finished for PR 11274 at commit 0cf8ad0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

SparkQA · 2016-02-24T00:39:32Z

Test build #51814 has finished for PR 11274 at commit 4faf5f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StorageStatusListener(conf: SparkConf) extends SparkListener
- public class JavaDocument implements Serializable
- public class JavaEstimatorTransformerParamExample
- public class JavaLabeledDocument extends JavaDocument implements Serializable
- public class JavaModelSelectionViaCrossValidationExample
- public class JavaModelSelectionViaTrainValidationSplitExample
- public class JavaPipelineExample
- public class JavaPCAExample
- public class JavaSVDExample
- public abstract class BufferedRowIterator
- class QuantileSummaries(
- case class Stats(value: Double, g: Int, delta: Int)

SparkQA · 2016-02-25T01:49:38Z

Test build #51912 has finished for PR 11274 at commit ef9e8f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-25T10:19:00Z

Test build #51948 has finished for PR 11274 at commit ca8fe0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T08:31:03Z

Test build #2595 has finished for PR 11274 at commit ca8fe0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T08:38:10Z

Test build #52224 has finished for PR 11274 at commit 1a1452e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T08:43:04Z

Test build #2594 has finished for PR 11274 at commit ca8fe0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274). Author: Davies Liu <[email protected]> Closes #11437 from davies/decode_dict.

davies · 2016-03-01T22:26:27Z

@nongli This one seems important for us now, could you start to review it?

nongli · 2016-03-04T19:49:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

+   */
+  def usedInputs: AttributeSet = references
+
  /**


Can you comment what the semantics are if row == null vs not null?

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

SparkQA · 2016-03-04T23:55:06Z

Test build #52495 has finished for PR 11274 at commit 76ca6c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class UnsafeRowParquetRecordReader extends SpecificParquetRecordReaderBase<Object>

SparkQA · 2016-03-05T07:07:58Z

Test build #52507 has finished for PR 11274 at commit ffc9d8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-05T07:12:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

       | private void $scanRows(InternalRow $row) throws java.io.IOException {
-       |   while (true) {
+       |   boolean firstRow = true;
+       |   while (!shouldStop() && (firstRow || $input.hasNext())) {


@nongli Since we changed to use continue for predicates, it's tricky to get this right.

nongli · 2016-03-07T23:07:16Z

@davies can you update the gist with the new output?

nongli · 2016-03-07T23:28:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

+    * Returns source code to evaluate all the variables, and clear the code of them, to prevent
+    * them to be evaluated twice.
+    */
+  protected def evaluateVariables(variables: Seq[ExprCode]): String = {


Can you update the comment for ExprCode.code to specify what it means when it is empty.

nongli · 2016-03-07T23:37:17Z

Patch looks good. Just a few documentation suggestions.

davies · 2016-03-08T00:23:22Z

@nongli I had updated the doc and commit message (PR description), once it pass the tests, I will merge this into master.

nongli · 2016-03-08T01:00:20Z

sounds good.

SparkQA · 2016-03-08T01:35:55Z

Test build #52612 has finished for PR 11274 at commit f431170.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-08T04:08:48Z

Merging this into master, thanks!

## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR apache#11274). Author: Davies Liu <[email protected]> Closes apache#11437 from davies/decode_dict.

… used ## What changes were proposed in this pull request? This PR change the way how we generate the code for the output variables passing from a plan to it's parent. Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted. This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`. This PR also change the `if` from ``` if (cond) { xxx } ``` to ``` if (!cond) continue; xxx ``` The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin. It also added some comments for operators. ## How was the this patch tested? Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s) Author: Davies Liu <[email protected]> Closes apache#11274 from davies/gen_defer.

## What changes were proposed in this pull request? This PR rollback some changes in apache#11274 , which introduced some performance regression when do a simple aggregation on parquet scan with one integer column. Does not really understand how this change introduce this huge impact, maybe related show JIT compiler inline functions. (saw very different stats from profiling). ## How was this patch tested? Manually run the parquet reader benchmark, before this change: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 2391 / 3107 43.9 22.8 1.0X ``` After this change ``` Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 Intel(R) Core(TM) i7-4558U CPU 2.80GHz Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 2032 / 2626 51.6 19.4 1.0X``` Author: Davies Liu <[email protected]> Closes apache#11912 from davies/fix_regression.

srowen · 2018-08-02T01:38:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

+      required: AttributeSet): String = {
+    var evaluateVars = ""
+    variables.zipWithIndex.foreach { case (ev, i) =>
+      if (ev.code != "" && required.contains(attributes(i))) {


@davies I was just reviewing build warnings, and it flags this line. ev.code is a Block rather than String. Should it be ev.code.nonEmpty && ... instead?

davies force-pushed the gen_defer branch from 37e5560 to 8fd2ceb Compare February 19, 2016 18:03

davies changed the title ~~[SPARK-13404] improve codegen~~ [SPARK-13404] [SQL] Create variables for input row when it's actually used Feb 19, 2016

improve codegen

c92e457

davies force-pushed the gen_defer branch from 8fd2ceb to c92e457 Compare February 19, 2016 18:42

Davies Liu added 2 commits February 22, 2016 12:40

Merge branch 'master' of github.com:apache/spark into gen_defer

e58c8a6

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

Defer the evaluation of expresssions in Project

f6139e6

davies force-pushed the gen_defer branch from 0cf8ad0 to f6139e6 Compare February 22, 2016 21:57

davies mentioned this pull request Feb 23, 2016

[SPARK-13123][SQL] Implement whole state codegen for sort. #11008

Closed

Davies Liu added 2 commits February 23, 2016 15:22

Merge branch 'master' of github.com:apache/spark into gen_defer

4faf5f9

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

fix broadcast hash join

4fb0bc8

fix bug

ef9e8f3

Davies Liu added 2 commits February 25, 2016 00:15

Merge branch 'master' of github.com:apache/spark into gen_defer

705da3f

fix aggregate

ca8fe0f

Davies Liu added 2 commits February 26, 2016 15:31

Merge branch 'master' of github.com:apache/spark into gen_defer

853502e

Merge branch 'master' of github.com:apache/spark into gen_defer

1a1452e

davies mentioned this pull request Feb 29, 2016

[SPARK-13582] [SQL] defer dictionary decoding in parquet reader #11437

Closed

davies mentioned this pull request Mar 4, 2016

[SPARK-13636][SQL] Directly consume UnsafeRow in wholestage codegen plans #11484

Closed

nongli reviewed Mar 4, 2016
View reviewed changes

Davies Liu added 2 commits March 4, 2016 15:01

Merge branch 'master' of github.com:apache/spark into gen_defer

62682b2

Merge branch 'master' of github.com:apache/spark into gen_defer

76ca6c6

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

fix tests

ffc9d8c

davies reviewed Mar 5, 2016
View reviewed changes

nongli reviewed Mar 7, 2016
View reviewed changes

improve docs

f431170

asfgit closed this in 25bba58 Mar 8, 2016

davies mentioned this pull request Mar 23, 2016

[SPARK-14092] [SQL] move shouldStop() to end of while loop #11912

Closed

srowen reviewed Aug 2, 2018

View reviewed changes

[SPARK-13404] [SQL] Create variables for input row when it's actually used #11274

[SPARK-13404] [SQL] Create variables for input row when it's actually used #11274

Uh oh!

Conversation

davies commented Feb 19, 2016

What changes were proposed in this pull request?

How was the this patch tested?

Uh oh!

davies commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

davies commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 24, 2016

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

davies commented Mar 1, 2016

Uh oh!

nongli Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 5, 2016

Uh oh!

davies Mar 5, 2016

Choose a reason for hiding this comment

Uh oh!

nongli commented Mar 7, 2016

Uh oh!

nongli Mar 7, 2016

Choose a reason for hiding this comment

Uh oh!

davies Mar 8, 2016

Choose a reason for hiding this comment

Uh oh!

nongli commented Mar 7, 2016

Uh oh!

davies commented Mar 8, 2016

Uh oh!

nongli commented Mar 8, 2016

Uh oh!

SparkQA commented Mar 8, 2016

Uh oh!

davies commented Mar 8, 2016

Uh oh!

srowen Aug 2, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants