-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10811] [SQL] Eliminates unnecessary byte array copying #8907
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For decimals whose precision is greater than 18, we still need to copy the byte array anyway to construct Java BigInteger instances.
|
Hm, after some profiling, I found that stealing byte arrays using |
This reverts commit d59daf6.
|
Test build #42979 has finished for PR 8907 at commit
|
|
Test build #42984 has finished for PR 8907 at commit
|
|
The last commit brings another ~5% performance boost. We should probably just use Java |
|
Test build #42985 has finished for PR 8907 at commit
|
|
Test build #42992 has finished for PR 8907 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, we always construct a Scala BigDecimal first, and then retrieve the underlying Java BigDecimal. Here we just create the Java one directly.
|
LGTM |
1 similar comment
|
LGTM |
|
@davies @cloud-fan Thanks for the review! I'm merging this to master. |
When reading Parquet string and binary-backed decimal values, Parquet
Binary.getBytesalways returns a copied byte array, which is unnecessary. Since the underlying implementation ofBinaryvalues there is guaranteed to beByteArraySliceBackedBinary, and Parquet itself never reuses underlying byte arrays, we can useBinary.toByteBuffer.array()to steal the underlying byte arrays without copying them.This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18.
My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS
store_salestable (scale factor 15).Another minor optimization done in this PR is that, now we directly construct a Java
BigDecimalinDecimal.toJavaBigDecimalwithout constructing a ScalaBigDecimalfirst. This brings another ~5% performance gain.