-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for Union types in RowConverter
#8839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for Union types in RowConverter
#8839
Conversation
9a62f3c to
2cd0253
Compare
2cd0253 to
5559011
Compare
| offsets, | ||
| mode, | ||
| } => { | ||
| let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap(); | |
| let union_array = array.as_any().downcast_ref::<UnionArray>().expect("expected Union array"); |
as at line 631
|
|
||
| let mut child_rows = Vec::with_capacity(converters.len()); | ||
| for (type_id, converter) in converters.iter().enumerate() { | ||
| let child_array = union_array.child(type_id as i8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here type_id is the index of the converter. It looks strange but it might be OK.
Could you use the items in type_ids instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense because the type_id is the index of the child field types
Maybe we can document better that converters is indexed by type_id 🤔
arrow-row/src/lib.rs
Outdated
| offsets: offsets_buf, | ||
| mode, | ||
| } => { | ||
| let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap(); |
since it is not used
| let len = rows.len(); | ||
|
|
||
| let DataType::Union(union_fields, mode) = &field.data_type else { | ||
| unreachable!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| unreachable!() | |
| unreachable!("Expected a Union but got: {}", &field.data_type) |
| } | ||
| DataType::Union(fields, mode) => { | ||
| // similar to dictionaries and lists, we set descending to false and negate nulls_first | ||
| // since the encodedc ontents will be inverted if descending is set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // since the encodedc ontents will be inverted if descending is set | |
| // since the encoded contents will be inverted if descending is set |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- this is looking good. I left some comments and @martin-g 's comments are good to review too
|
|
||
| let mut child_rows = Vec::with_capacity(converters.len()); | ||
| for (type_id, converter) in converters.iter().enumerate() { | ||
| let child_array = union_array.child(type_id as i8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense because the type_id is the index of the child field types
Maybe we can document better that converters is indexed by type_id 🤔
| Union { | ||
| child_rows: Vec<Rows>, | ||
| type_ids: ScalarBuffer<i8>, | ||
| offsets: Option<ScalarBuffer<i32>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strictly speaking the mode is redundant here -- if there are no offsets, then the mode is sparse, otherwise the mode is dense. You could probably simplify the code if you removed the redundancy
| (UnionMode::Dense, Some(o)) => o[i] as usize, | ||
| (UnionMode::Sparse, None) => i, | ||
| foreign => { | ||
| unreachable!("invalid union mode/offsets combination: {foreign:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above for a way to simplify this (don't hold mode too)
| } | ||
|
|
||
| #[test] | ||
| fn test_sparse_union() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also add tests here for union arrays that have nulls? Specifically for a union array that has a null buffer
|
|
||
| for (idx, row) in rows.iter_mut().enumerate() { | ||
| // skip the null sentinel | ||
| let mut cursor = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to look at the null byte to recover nulls 🤔
Which issue does this PR close?
Uniondata types for row format #8828Rationale for this change
This PR implements row format conversion for Union types (both sparse and dense modes) in the row kernel. Union types can now be encoded into the row format for sorting and comparison ops
It handles both sparse and dense union modes by encoding each row as a null sentinel byte, followed by the type id byte, and then the encoded child row data. During decoding, rows are grouped by their type id and routed to the appropriate child converter