Skip to content

Conversation

@andrewor14
Copy link
Contributor

@andrewor14 andrewor14 commented Jun 24, 2016

What changes were proposed in this pull request?

This patch makes InMemoryRelation faster by generating code to store the input rows as ColumnarBatches. This code path is enabled by default but only supports primitive types, falling back to the old, slower code path if there are unsupported types (e.g. strings, arrays, UDTs) in the schema.

The old code path reads the input rows into ColumnBuilders, which is slow because these builders are backed by ByteBuffers and there are a lot of virtual function calls involved, especially when compression is enabled.

The following numbers are derived from the read path (i.e. returning rows from cached batches in memory). The baseline is the first row. The second and third rows describe caching performance before this patch. The last row describes caching performance after this patch.

Cache random keys:                       Best/Avg Time(ms)   Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------
cache = F                                      890 /  920        47.1          21.2       1.0X
cache = T columnar_batches = F compress = F   1950 / 1978        21.5          46.5       0.5X
cache = T columnar_batches = F compress = T   1893 / 1927        22.2          45.1       0.5X
cache = T columnar_batches = T                 540 /  544        77.7          12.9       1.6X

Future TODOs (outside the scope of this issue):

  • support compression in the new code path
  • scan should return rows as ColumnarBatch.Rows instead of UnsafeRows so operators downstream can further benefit from the columnar representation

How was this patch tested?

CacheBenchmark, InMemoryColumnarQuerySuite, existing tests

Generated code

Write path: https://gist.github.com/andrewor14/a9ed9d942029457a0f953e809ac26ee9
Read path: https://gist.github.com/andrewor14/7ce4c37a3c6bcd5cc2b6b16c861859e9

Andrew Or added 21 commits June 17, 2016 13:44
Note, this doesn't work: spark.table("tab1").collect(), because
we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
however: spark.table("tab1").groupBy("i").sum("j").collect().
Previously we could only support schemas where all columns are
Longs because we hardcode putLong and getLong calls in the write
path. This led to unfathomable NPEs if we try to cache something
with other types.

This commit fixes this by generalizing the code to build column
batches.
@andrewor14 andrewor14 changed the title [SPARK-16196][SQL] Codegen caching + store rows as ColumnarBatches [SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches Jun 24, 2016
@andrewor14
Copy link
Contributor Author

@rxin @sameeragarwal @ooq

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61203 has finished for PR 13899 at commit c72c085.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61206 has finished for PR 13899 at commit 0125aa2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jun 25, 2016

@andrewor14 Looks interesting.

I created two PRs (#11956, #12894) that generate similar code like your code. My PRs use current ByteBuffer and support compressions for primitive types. Do these PRs help you?

I am waiting for comitter's review of my two PRs.

@kiszk
Copy link
Member

kiszk commented Aug 30, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64644 has finished for PR 13899 at commit 0125aa2.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #3237 has finished for PR 13899 at commit 0125aa2.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

Closing for now; too many conflicts.

@andrewor14 andrewor14 closed this Feb 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants