[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907

liancheng · 2015-09-24T20:47:14Z

When reading Parquet string and binary-backed decimal values, Parquet Binary.getBytes always returns a copied byte array, which is unnecessary. Since the underlying implementation of Binary values there is guaranteed to be ByteArraySliceBackedBinary, and Parquet itself never reuses underlying byte arrays, we can use Binary.toByteBuffer.array() to steal the underlying byte arrays without copying them.

This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18.

My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS store_sales table (scale factor 15).

Another minor optimization done in this PR is that, now we directly construct a Java BigDecimal in Decimal.toJavaBigDecimal without constructing a Scala BigDecimal first. This brings another ~5% performance gain.

liancheng · 2015-09-24T20:48:11Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala

For decimals whose precision is greater than 18, we still need to copy the byte array anyway to construct Java BigInteger instances.

liancheng · 2015-09-24T22:06:56Z

Hm, after some profiling, I found that stealing byte arrays using ByteBuffer is actually a little bit faster than the ByteArrayThief hack I employed in the 2nd commit. This is probably because of virtual function dispatching cost exposed by OutputStream.write. Will revert my last commit.

This reverts commit d59daf6.

SparkQA · 2015-09-24T22:49:51Z

Test build #42979 has finished for PR 8907 at commit b99158e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T23:15:54Z

Test build #42984 has finished for PR 8907 at commit d59daf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-09-24T23:25:02Z

The last commit brings another ~5% performance boost. We should probably just use Java BigDecimal directly within Catalyst Decimal. I guess this would make decimals with large precisions (> 18) a little bit faster. But that can be done in a separate PR.

SparkQA · 2015-09-25T00:07:35Z

Test build #42985 has finished for PR 8907 at commit 6d85e69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-25T01:14:39Z

Test build #42992 has finished for PR 8907 at commit 851f91f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-09-25T17:04:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

Originally, we always construct a Scala BigDecimal first, and then retrieve the underlying Java BigDecimal. Here we just create the Java one directly.

davies · 2015-09-29T23:12:04Z

LGTM

cloud-fan · 2015-09-30T01:37:59Z

LGTM

liancheng · 2015-09-30T06:29:51Z

@davies @cloud-fan Thanks for the review! I'm merging this to master.

Eliminates unnecessary byte array copying

b99158e

liancheng reviewed Sep 24, 2015
View reviewed changes

Adds ByteArrayThief to avoid allocating ByteBuffer

d59daf6

Revert "Adds ByteArrayThief to avoid allocating ByteBuffer"

6d85e69

This reverts commit d59daf6.

Builds Java BigDecimal directly in Decimal.toJavaBigDecimal

851f91f

liancheng reviewed Sep 25, 2015
View reviewed changes

asfgit closed this in 4d5a005 Sep 30, 2015

liancheng deleted the spark-10811/eliminate-array-copying branch September 30, 2015 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907

[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907

Uh oh!

liancheng commented Sep 24, 2015

Uh oh!

liancheng Sep 24, 2015

Uh oh!

liancheng commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

liancheng commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 25, 2015

Uh oh!

SparkQA commented Sep 25, 2015

Uh oh!

liancheng Sep 25, 2015

Uh oh!

davies commented Sep 29, 2015

Uh oh!

cloud-fan commented Sep 30, 2015

Uh oh!

liancheng commented Sep 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907

[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907

Uh oh!

Conversation

liancheng commented Sep 24, 2015

Uh oh!

liancheng Sep 24, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

liancheng commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 25, 2015

Uh oh!

SparkQA commented Sep 25, 2015

Uh oh!

liancheng Sep 25, 2015

Choose a reason for hiding this comment

Uh oh!

davies commented Sep 29, 2015

Uh oh!

cloud-fan commented Sep 30, 2015

Uh oh!

liancheng commented Sep 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants