-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage #2327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is submitted separately in #2325 as this PR may take longer time to finish.
|
QA tests have started for PR 2327 at commit
|
|
QA tests have finished for PR 2327 at commit
|
|
Out of curiosity, does this also eliminate boxing for nested data types? |
|
No, unlike Parquet, currently our in-memory columnar format doesn't support complex nested objects well. They are just serialized by Kryo and stored as opaque byte arrays. |
|
@aarondav to expand on that, as soon as there is any nesting all of our clever tricks for eliminating allocations go out the window. We can probably improve this in future releases. |
…d implementations
…rs a head of time
5cacd9a to
97bbc4e
Compare
|
QA tests have started for PR 2327 at commit
|
|
QA tests have finished for PR 2327 at commit
|
|
ok to test |
|
QA tests have started for PR 2327 at commit
|
|
QA tests have finished for PR 2327 at commit
|
|
test this please |
|
QA tests have started for PR 2327 at commit
|
|
QA tests have started for PR 2327 at commit
|
|
QA tests have finished for PR 2327 at commit
|
|
Tests timed out after a configured wait of |
|
@marmbrus Please help review this one. |
|
QA tests have started for PR 2327 at commit
|
|
QA tests have finished for PR 2327 at commit
|
|
I need to look this over still, but want to remove WIP? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style is going to go away in 2.12 or 2.13 I think. Should be :Unit =
|
Nice speed ups. I think they might be even more pronounced when there are multiple threads fighting for the GC. Minor comments only. Will merge after they are addressed. |
|
QA tests have started for PR 2327 at commit
|
|
QA tests have finished for PR 2327 at commit
|
|
Thanks! I've merged this to master. |
This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with
SpecificMutableRow. The difficult part is how to adapt all compression schemes, esp.RunLengthEncodingandDictionaryEncoding, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR.UPDATE This PR also took the chance to optimize
HiveTableScanbySpecificMutableRowto avoid boxing cost, andWritableunwrapper functions a head of time to avoid per row pattern matching and branching costs.TODO
Eliminate boxing costs in(left to future PRs)RunLengthEncodingEliminate boxing costs in(left to future PRs)DictionaryEncoding(seems not easy to do without specializingDictionaryEncodingfor every supported column type)Micro benchmark
The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table.
Benchmark code can be found here. Script used to generate the input table can be found here.
Speedup:
Hive table scanning + column buffer building: 18.74%
The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster.
In-memory table scanning: 7.95%
Before:
After: