[SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches #13899

andrewor14 · 2016-06-24T21:32:07Z

What changes were proposed in this pull request?

This patch makes InMemoryRelation faster by generating code to store the input rows as ColumnarBatches. This code path is enabled by default but only supports primitive types, falling back to the old, slower code path if there are unsupported types (e.g. strings, arrays, UDTs) in the schema.

The old code path reads the input rows into ColumnBuilders, which is slow because these builders are backed by ByteBuffers and there are a lot of virtual function calls involved, especially when compression is enabled.

The following numbers are derived from the read path (i.e. returning rows from cached batches in memory). The baseline is the first row. The second and third rows describe caching performance before this patch. The last row describes caching performance after this patch.

Cache random keys:                       Best/Avg Time(ms)   Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------
cache = F                                      890 /  920        47.1          21.2       1.0X
cache = T columnar_batches = F compress = F   1950 / 1978        21.5          46.5       0.5X
cache = T columnar_batches = F compress = T   1893 / 1927        22.2          45.1       0.5X
cache = T columnar_batches = T                 540 /  544        77.7          12.9       1.6X

Future TODOs (outside the scope of this issue):

support compression in the new code path
scan should return rows as ColumnarBatch.Rows instead of UnsafeRows so operators downstream can further benefit from the columnar representation

How was this patch tested?

CacheBenchmark, InMemoryColumnarQuerySuite, existing tests

Generated code

Write path: https://gist.github.com/andrewor14/a9ed9d942029457a0f953e809ac26ee9
Read path: https://gist.github.com/andrewor14/7ce4c37a3c6bcd5cc2b6b16c861859e9

Note, this doesn't work: spark.table("tab1").collect(), because we're trying to cast ColumnarBatch.Row into UnsafeRow. This works, however: spark.table("tab1").groupBy("i").sum("j").collect().

Previously we could only support schemas where all columns are Longs because we hardcode putLong and getLong calls in the write path. This led to unfathomable NPEs if we try to cache something with other types. This commit fixes this by generalizing the code to build column batches.

andrewor14 · 2016-06-24T22:06:44Z

@rxin @sameeragarwal @ooq

SparkQA · 2016-06-24T23:34:37Z

Test build #61203 has finished for PR 13899 at commit c72c085.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-25T00:00:07Z

Test build #61206 has finished for PR 13899 at commit 0125aa2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-06-25T17:44:34Z

@andrewor14 Looks interesting.

I created two PRs (#11956, #12894) that generate similar code like your code. My PRs use current ByteBuffer and support compressions for primitive types. Do these PRs help you?

I am waiting for comitter's review of my two PRs.

kiszk · 2016-08-30T09:11:07Z

jenkins retest this please

SparkQA · 2016-08-30T11:15:17Z

Test build #64644 has finished for PR 13899 at commit 0125aa2.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-08-30T20:46:16Z

Test build #3237 has finished for PR 13899 at commit 0125aa2.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

andrewor14 · 2017-02-16T21:31:54Z

Closing for now; too many conflicts.

Andrew Or added 21 commits June 17, 2016 13:44

Move it

be1ae40

Add benchmark code

bf11d27

backup

82499c3

Narrow benchmarked code + add back old scan code

2f12e96

Fix benchmark to time only the read path

6da1e71

First working impl. of ColumnarBatch based caching

fdf321e

Note, this doesn't work: spark.table("tab1").collect(), because we're trying to cast ColumnarBatch.Row into UnsafeRow. This works, however: spark.table("tab1").groupBy("i").sum("j").collect().

Always enable codegen and vectorized hashmap

d0d2661

Don't benchmark aggregate

570d0c3

Codegen memory scan using ColumnarBatches

3e96f4e

Clean up the code a little

5726d11

Merge branch 'master' of github.com:apache/spark into speedup-cache

d255eb0

Clean up a little more

f4f8182

Merge branch 'master' of github.com:apache/spark into speedup-cache

b6618d7

Move cache benchmark to new file

06bbfdb

Abstract codegen code into ColumnarBatchScan

1a12d06

Introduce CACHE_CODEGEN config to reduce dup code

8cdbdd0

Add some tests for InMemoryRelation

faa6776

Add some tests for InMemoryRelation

2ba6b1e

Fix InMemoryColumnarQuerySuite

7f09753

Clean up code: abstract CachedBatch and ColumnarBatch

c72c085

andrewor14 changed the title ~~[SPARK-16196][SQL] Codegen caching + store rows as ColumnarBatches~~ [SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches Jun 24, 2016

Add end-to-end benchmark, including write path

0125aa2

davies mentioned this pull request Aug 29, 2016

[SPARK-14098][SQL] Generate Java code that gets a float/double value in each column of CachedBatch when DataFrame.cache() is called #11956

Closed

kiszk mentioned this pull request Sep 23, 2016

[WIP][SPARK-14098][SQL] Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called #15219

Closed

kiszk mentioned this pull request Dec 6, 2016

[SPARK-17912][SQL] Refactor code generation to get data for ColumnVector/ColumnarBatch #15467

Closed

andrewor14 closed this Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches #13899

[SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches #13899

Uh oh!

andrewor14 commented Jun 24, 2016 •

edited

Loading

Uh oh!

andrewor14 commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 25, 2016

Uh oh!

kiszk commented Jun 25, 2016 •

edited

Loading

Uh oh!

kiszk commented Aug 30, 2016

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

andrewor14 commented Feb 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches #13899

[SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches #13899

Uh oh!

Conversation

andrewor14 commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Generated code

Uh oh!

andrewor14 commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 25, 2016

Uh oh!

kiszk commented Jun 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Aug 30, 2016

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

andrewor14 commented Feb 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewor14 commented Jun 24, 2016 •

edited

Loading

kiszk commented Jun 25, 2016 •

edited

Loading