-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12854][SQL] Implement complex types support in ColumnarBatch #10820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…atch. WIP: this patch adds some random row generation. The test code needs to be cleaned up as it duplicates functionality from else where. The non-test code should be good to review. This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is array<array<int>>, all of the int data is stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer.
|
Test build #49636 has finished for PR 10820 at commit
|
|
if the schema is array>,? array of int? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how lengths and offsets are stored? also is there a single "parent" column that encodes nullability, length, and offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the comment when I rev the PR but the answer is:
I'm not sure what single "parent" means. The array is a column that stores nullability, lengths and offsets. The child column stores the values, including their nullability. Either one can be independently nullable or not.
Lengths and offsets are encoded like 'ints'.
|
Will ColumnarBatch become a column version of UnsafeRow? Or, it will become more general columnar store? |
|
@kiszk We're looking to make the execution engine represent data in a columnar way. I'm not sure what you mean by more general columnar store. I usually think of that more for storage. |
|
Test build #50029 has finished for PR 10820 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also work for on-heap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this in the same group of functions as freeMemory, allocateMemory. What would the on heap version look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it compare to Unsafe.copyMemory()?
|
Test build #50044 has finished for PR 10820 at commit
|
|
Test build #50046 has finished for PR 10820 at commit
|
|
Test build #50051 has finished for PR 10820 at commit
|
|
Test build #50080 has finished for PR 10820 at commit
|
|
Test build #50119 has finished for PR 10820 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8string supports reading offheap too so you can probably improve this substantially.
|
Test build #2460 has finished for PR 10820 at commit
|
|
Test build #2464 has finished for PR 10820 at commit
|
|
Going to merge this in master. Thanks. |
| childCapacity *= DEFAULT_ARRAY_LENGTH; | ||
| } | ||
| this.childColumns = new ColumnVector[1]; | ||
| this.childColumns[0] = ColumnVector.allocate(childCapacity, childType, memMode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only grow the capacity for non-array type?
This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs
and arrays. There is a simple mapping between the richer catalyst types to these two. Strings
are treated as an array of bytes.
ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays contain offsets
and lengths into the child array. This structure is able to handle arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively.
As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains variable length elements.
The vectors are not fixed length and will grow as necessary. This should make the usage a lot
simpler for the writer.