Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR implements row format conversion for Union types (both sparse and dense modes) in the row kernel. Union types can now be encoded into the row format for sorting and comparison ops

It handles both sparse and dense union modes by encoding each row as a null sentinel byte, followed by the type id byte, and then the encoded child row data. During decoding, rows are grouped by their type id and routed to the appropriate child converter

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-row-converter branch 2 times, most recently from 9a62f3c to 2cd0253 Compare November 13, 2025 22:03
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-row-converter branch from 2cd0253 to 5559011 Compare November 13, 2025 22:09
offsets,
mode,
} => {
let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap();
let union_array = array.as_any().downcast_ref::<UnionArray>().expect("expected Union array");

as at line 631


let mut child_rows = Vec::with_capacity(converters.len());
for (type_id, converter) in converters.iter().enumerate() {
let child_array = union_array.child(type_id as i8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here type_id is the index of the converter. It looks strange but it might be OK.
Could you use the items in type_ids instead ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense because the type_id is the index of the child field types

Maybe we can document better that converters is indexed by type_id 🤔

offsets: offsets_buf,
mode,
} => {
let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap();

since it is not used

let len = rows.len();

let DataType::Union(union_fields, mode) = &field.data_type else {
unreachable!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unreachable!()
unreachable!("Expected a Union but got: {}", &field.data_type)

}
DataType::Union(fields, mode) => {
// similar to dictionaries and lists, we set descending to false and negate nulls_first
// since the encodedc ontents will be inverted if descending is set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// since the encodedc ontents will be inverted if descending is set
// since the encoded contents will be inverted if descending is set

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- this is looking good. I left some comments and @martin-g 's comments are good to review too


let mut child_rows = Vec::with_capacity(converters.len());
for (type_id, converter) in converters.iter().enumerate() {
let child_array = union_array.child(type_id as i8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense because the type_id is the index of the child field types

Maybe we can document better that converters is indexed by type_id 🤔

Union {
child_rows: Vec<Rows>,
type_ids: ScalarBuffer<i8>,
offsets: Option<ScalarBuffer<i32>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking the mode is redundant here -- if there are no offsets, then the mode is sparse, otherwise the mode is dense. You could probably simplify the code if you removed the redundancy

(UnionMode::Dense, Some(o)) => o[i] as usize,
(UnionMode::Sparse, None) => i,
foreign => {
unreachable!("invalid union mode/offsets combination: {foreign:?}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above for a way to simplify this (don't hold mode too)

}

#[test]
fn test_sparse_union() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also add tests here for union arrays that have nulls? Specifically for a union array that has a null buffer


for (idx, row) in rows.iter_mut().enumerate() {
// skip the null sentinel
let mut cursor = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to look at the null byte to recover nulls 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Union data types for row format

3 participants