Skip to content

Conversation

@liancheng
Copy link
Contributor

When reading Parquet string and binary-backed decimal values, Parquet Binary.getBytes always returns a copied byte array, which is unnecessary. Since the underlying implementation of Binary values there is guaranteed to be ByteArraySliceBackedBinary, and Parquet itself never reuses underlying byte arrays, we can use Binary.toByteBuffer.array() to steal the underlying byte arrays without copying them.

This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18.

My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS store_sales table (scale factor 15).

Another minor optimization done in this PR is that, now we directly construct a Java BigDecimal in Decimal.toJavaBigDecimal without constructing a Scala BigDecimal first. This brings another ~5% performance gain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For decimals whose precision is greater than 18, we still need to copy the byte array anyway to construct Java BigInteger instances.

@liancheng
Copy link
Contributor Author

Hm, after some profiling, I found that stealing byte arrays using ByteBuffer is actually a little bit faster than the ByteArrayThief hack I employed in the 2nd commit. This is probably because of virtual function dispatching cost exposed by OutputStream.write. Will revert my last commit.

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #42979 has finished for PR 8907 at commit b99158e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #42984 has finished for PR 8907 at commit d59daf6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

The last commit brings another ~5% performance boost. We should probably just use Java BigDecimal directly within Catalyst Decimal. I guess this would make decimals with large precisions (> 18) a little bit faster. But that can be done in a separate PR.

@SparkQA
Copy link

SparkQA commented Sep 25, 2015

Test build #42985 has finished for PR 8907 at commit 6d85e69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 25, 2015

Test build #42992 has finished for PR 8907 at commit 851f91f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, we always construct a Scala BigDecimal first, and then retrieve the underlying Java BigDecimal. Here we just create the Java one directly.

@davies
Copy link
Contributor

davies commented Sep 29, 2015

LGTM

1 similar comment
@cloud-fan
Copy link
Contributor

LGTM

@liancheng
Copy link
Contributor Author

@davies @cloud-fan Thanks for the review! I'm merging this to master.

@asfgit asfgit closed this in 4d5a005 Sep 30, 2015
@liancheng liancheng deleted the spark-10811/eliminate-array-copying branch September 30, 2015 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants