-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13582] [SQL] defer dictionary decoding in parquet reader #11437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @nongli |
|
Test build #52202 has finished for PR 11437 at commit
|
|
Test build #52205 has finished for PR 11437 at commit
|
|
Test build #52206 has finished for PR 11437 at commit
|
|
Test build #52207 has finished for PR 11437 at commit
|
|
Test build #2593 has finished for PR 11437 at commit
|
| int num = Math.min(total, leftInPage); | ||
| if (useDictionary) { | ||
| // Data is dictionary encoded. We will vector decode the ids and then resolve the values. | ||
| if (dictionaryIds == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove dictionaryIds from this class.
|
Can you run the ColumnarBatch/ParquetRead benchmark? Does this have perf problems if there is no dictionary or there is no filter? |
|
@nongli There is no visible difference on all existing benchmarks (ColumnarBatch and ParquetRead), they don't use dictionary encoding. After changed the intStringScan to use dictionary encoding (small number unique values), here is the result: Before this patch After the patch We can see 10% improvement on SQL Parquet Vectorized, but no difference on ParquetReader, I don't know why. (I didn't included #11274 ) |
|
Cool. Lgtm |
|
Test build #52251 has finished for PR 11437 at commit
|
|
Merging this into master. |
## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR apache#11274). Author: Davies Liu <[email protected]> Closes apache#11437 from davies/decode_dict.
What changes were proposed in this pull request?
This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.
This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.
How was this patch tested?
Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).