-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22383][SQL] Generate code to directly get value of primitive type array from ColumnVector for table cache #19601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #83179 has finished for PR 19601 at commit
|
|
@ueshin @cloud-fan could you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change the return type. ColumnVector will be public eventually, and ArrayData is not a public type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
One question. ColumnVector.Array has some public fields such as length. I think that it would be good to use an accessor numElements or getLength. What do you think?
|
My feeling is that, we should change the cache format of array type to make it compatible with |
|
Current |
|
So for primitive types, we encode and compress them to binary. When reading cached data, they are decoded to primitive array and can be put in For primitive type array, we treat it as binary. So when decoding it, we get a byte[] and need more effort to convert it to primitive type and put in Can we change how we encode array type like Arrow did? |
|
There are two approaches to support a primitive array that is treated as binary. One is to add new I can add a new To use |
|
I'd like to also improve the write path. I think the current way to cache array type is not efficient, arrow-like format which put all elements(including nested array) together is better for encoding and compression. |
|
I agree with you that we need to improve the write path. It will be addressed after improving the frequently-executed read path, as you suggested before. To improve the writh path will be addressed by the following PR. I think that there two parts: 1) change data format and 2) generate specialized code for each table cache. For improving the read path, which approach is better? To add new |
|
both ways work, just pick the simpler one. I'm concerned about how to access the nested array, you can try both approaches and see which one can solve the problem easier. |
|
For now, this implementation has an limitation only to support non-nested array for ease of review. I will try to support the nested array. |
|
After I think about the choice for a while, I conclude that it is better to add the new |
|
Jenkins, retest this please |
|
Test build #83208 has finished for PR 19601 at commit
|
|
Test build #83210 has finished for PR 19601 at commit
|
|
Test build #83223 has started for PR 19601 at commit |
|
Jenkins, retest this please |
|
Test build #83246 has finished for PR 19601 at commit
|
|
Jenkins, retest this please |
|
Test build #83250 has finished for PR 19601 at commit
|
|
@cloud-fan could you please review this PR? For ease of review, I would like to ask to review this PR for a simple case (non-nested primitive array) at first. |
|
can we hold it for a while? I'm thinking about ColumnVector refactoring and see how to deal with nested data uniformly. |
|
My prototype for nested array can handle nested array by changing |
|
Test build #83460 has finished for PR 19601 at commit
|
|
Test build #83463 has started for PR 19601 at commit |
|
Jenkins, retest this please |
|
Test build #83465 has finished for PR 19601 at commit
|
|
@cloud-fan could you please review this again since this version avoids to override |
|
Test build #83689 has finished for PR 19601 at commit
|
|
There are some parts that relies on the format of |
|
We'd need to change the |
|
I see. Let us revisit this design later. I would appreciate it if you would review this columnar cache reader with simple primitive-type (non-nested) array. |
|
Test build #84080 has finished for PR 19601 at commit
|
|
Test build #84082 has finished for PR 19601 at commit
|
|
@cloud-fan could you please review this? |
|
Jenkins, retest this please |
|
Test build #84215 has finished for PR 19601 at commit
|
|
Test build #84253 has finished for PR 19601 at commit
|
|
Hi, @kiszk . Is this still valid for 3.0.0? |
|
Hi, @kiszk . Can we close this for now? You can make another PR later if you want. |
|
Sure, let me close this |
|
Thanks! |
What changes were proposed in this pull request?
This PR generates the Java code to directly get a value for a primitive type array in ColumnVector without using an iterator for table cache (e.g. dataframe.cache). This PR improves runtime performance by eliminating data copy from column-oriented storage to InternalRow in a SpecificColumnarIterator iterator for primitive type. This is a follow-up PR of #18747.
The idea of this implementation is to add
ColumnVector.UnsafeArrayto keepUnsafeArrayDatafor an array in addition toColumnVector.Arraythat keepsColumnVectorfor a Java primitive array for an array.Benchmark result: 21.4x
Benchmark program
How was this patch tested?
Added test cases into
ColumnVectorSuite,DataFrameTungstenSuite, andWholeStageCodegenSuite