-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14098][SQL] Generate Java code that gets a float/double value in each column of CachedBatch when DataFrame.cache() is called #11956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #54165 has finished for PR 11956 at commit
|
|
Test build #54168 has finished for PR 11956 at commit
|
|
Test build #54178 has finished for PR 11956 at commit
|
|
Test build #54583 has finished for PR 11956 at commit
|
|
Test build #54604 has finished for PR 11956 at commit
|
|
Test build #54756 has finished for PR 11956 at commit
|
|
Test build #55487 has finished for PR 11956 at commit
|
|
Test build #55637 has finished for PR 11956 at commit
|
|
Test build #55639 has finished for PR 11956 at commit
|
|
Test build #55643 has finished for PR 11956 at commit
|
|
Test build #55725 has finished for PR 11956 at commit
|
|
Test build #55733 has finished for PR 11956 at commit
|
|
Test build #55837 has finished for PR 11956 at commit
|
|
Test build #56016 has finished for PR 11956 at commit
|
|
Jenkins, retest this please |
|
Test build #56036 has finished for PR 11956 at commit
|
|
Test build #56058 has finished for PR 11956 at commit
|
|
Test build #56093 has finished for PR 11956 at commit
|
|
Test build #56203 has finished for PR 11956 at commit
|
|
Test build #56236 has finished for PR 11956 at commit
|
|
Test build #56237 has finished for PR 11956 at commit
|
|
Test build #56255 has finished for PR 11956 at commit
|
|
Jenkins, retest this please |
|
Test build #61919 has finished for PR 11956 at commit
|
|
Test build #62019 has finished for PR 11956 at commit
|
|
Hi @davies could you please take a look at this since Spark 2.0.0 has been successfully released? |
|
@rxin, could you please review this? |
|
@davies, I hope that you have some bandwidth to review PRs. Could you please review this, too? |
|
@davies could you please review this? |
|
@kiszk I'm sorry that I do not have the bandwidth to review this, https://github.com/apache/spark/pull/13899/files sounds like an easier approach (have not looked into the details), how do you think of these two? |
|
@davies, thank you for your comment. I hope that you will have bandwidth soon since Spark 2.0 was released. In particular, generated code for reading a column is almost the same. I listed up how these two approaches does three features. this PR does the same thing.
I like to simplify my PR by using the idea in the PR. For example, I can throw away new files I have the following questions:
What do you think? |
|
@davies Would it be possible to share your opinions regarding these design questions among us? |
|
@hvanhovell @marmbrus @srowen I see this PR has been open since the 25th of March and provides substantial performance improvements as mentioned above without introducing functional regressions, as leading SQL/community members what do you guys think? |
|
I'm not qualified to comment as I tend to ignore SQL unless it's a simple and easy to evaluate change. |
|
@davies Could you please share your great opinions regarding these design questions among our community while we know you are busy? |
|
@kiszk The current implementation use ByteBuffer and smart compression algorithms, it too slow to build the in-memory cache, make it useless. So we'd like to improve the performance of building phase also. This PR is built for current representation, which may be throwed away in future, so I'd not rush to merge this PR, or spent to much time to review the details. PR #13899 could be in the right direction, but need to double-check that by more benchmarks. In order to have better memory efficiency, we could use MEMORY_AND_SER storage level and compress the underlying array with LZ4 when serializing the ColumnVector. |
|
cc @rxin |
|
@davies Thank you for sharing your valuable thought. I understand the future direction. I will implement this direction by using this PR or another PR. While the future roadmap may consume more memory, the performance will be improved. The compression will be applied only when What benchmark programs do you want to use to double-check this roadmap? |
|
We could compress them in memory with MEMORY_AND_DISK_SER, this could be controlled by a flag. |
|
Thank you for your clarification. Here is a good summary for me. |
|
I saw PR #13899. I understood there are two design points for now. Hopefully, no more points :)
For 1, current ColumnarBatch is not serializable. |
|
Current PR #13899 does not support a case that an element is null.
|
|
Test build #71153 has finished for PR 11956 at commit
|
|
Test build #73611 has finished for PR 11956 at commit
|
|
@kiszk Is this still the issue? I knew you are working on the related part now. |
|
Thank you for pointing it out. #18747 implemented this feature. |
What changes were proposed in this pull request?
This PR generates Java code to get a float/double value of each column from CachedBatch when DataFrame.cache() is called. This is done in whole stage code generation.
When DataFrame.cache() is called, data is stored as column-oriented storage (columnar cache) in CachedBatch. This PR avoid conversion from column-oriented storage to row-oriented storage.
This PR handles only float and double that are stored in a column without compression. Another PR will handle other primitive types that may be stored in a column in a compressed format. This is for ease of review by reducing the size of PR
This PR consists of two parts.
decompress()method. CachedBatch consists of multiple ByteBuffer arrays. A ByteBuffer is just passed to generated code.This PR generates Java code for columnar cache only if types in all columns, which are accessed in operations, are primitive
This PR improves performance of aggregate sum by 3.8x - 5.2x. This benchmark is available at here
Performance results:
Motivating example:
Generated code
How was this patch tested?
Tested existing test suites
added test suites for operations to dataframe generated by df.cache().