-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9404][SPARK-9542][SQL] unsafe array data and map data #7752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #38855 has finished for PR 7752 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will fail on all places that use toArray in UnsafeArrayData right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll never call toArray in codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK the thing I'm worried about is fallback mode, and also I want to make it so unsaferow can be passed onto normal non-codegen/unsafe operators.
|
Test build #38969 has finished for PR 7752 at commit
|
|
Test build #38984 has finished for PR 7752 at commit
|
|
I've merged the other ArrayData patch. Can you bring this one up to date? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a more abstract version of UnsafeRowWriters, we can make it work for both row and array in future refactor.
|
Test build #39356 has finished for PR 7752 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should definitely document the binary format of UnsafeArrayData in the javadoc.
|
Test build #39400 has finished for PR 7752 at commit
|
|
Test build #39402 has finished for PR 7752 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may could just use UnsafeRow for ArrayType, it's not the most efficient one, but easy to catch the deadline, we could optimize it later, until we have more clear mind on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He's already done with it hasn't he?
|
Test build #39410 has finished for PR 7752 at commit
|
|
Test build #39411 has finished for PR 7752 at commit
|
|
Test build #39430 has finished for PR 7752 at commit
|
|
Test build #39436 has finished for PR 7752 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ghe -> the
feda148 to
345c064
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label baseObject as @nullable
|
Test build #39486 has finished for PR 7752 at commit
|
|
Test build #39489 timed out for PR 7752 at commit |
|
Test build #1303 has finished for PR 7752 at commit
|
|
OK going to merge this. |
…enerateSafe In #7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. Author: Wenchen Fan <[email protected]> Closes #8029 from cloud-fan/from-unsafe and squashes the following commits: ed40d8f [Wenchen Fan] add the copy back a93fd4b [Wenchen Fan] cogengen FromUnsafe (cherry picked from commit 106c078) Signed-off-by: Davies Liu <[email protected]>
…enerateSafe In #7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. Author: Wenchen Fan <[email protected]> Closes #8029 from cloud-fan/from-unsafe and squashes the following commits: ed40d8f [Wenchen Fan] add the copy back a93fd4b [Wenchen Fan] cogengen FromUnsafe
…enerateSafe In apache#7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. Author: Wenchen Fan <[email protected]> Closes apache#8029 from cloud-fan/from-unsafe and squashes the following commits: ed40d8f [Wenchen Fan] add the copy back a93fd4b [Wenchen Fan] cogengen FromUnsafe
…enerateSafe In apache/spark#7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. Author: Wenchen Fan <[email protected]> Closes #8029 from cloud-fan/from-unsafe and squashes the following commits: ed40d8f [Wenchen Fan] add the copy back a93fd4b [Wenchen Fan] cogengen FromUnsafe
This PR adds a UnsafeArrayData, current we encode it in this way:
first 4 bytes is the # elements
then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null.
followed by the elements themselves
an example: [10, 11, 12, 13, null, 14] will be encoded as:
5, 28, 32, 36, 40, -44, 44, 10, 11, 12, 13, 14
Note that, when we read a UnsafeArrayData from bytes, we can read the first 4 bytes as numElements and take the rest(first 4 bytes skipped) as value region.
unsafe map data just use 2 unsafe array data, first 4 bytes is # of elements, second 4 bytes is numBytes of key array, the follows key array data and value array data.