Add support for `Union` types in `RowConverter` #8839

friendlymatthew · 2025-11-13T21:52:16Z

Which issue does this PR close?

Closes Support Union data types for row format #8828

Rationale for this change

This PR implements row format conversion for Union types (both sparse and dense modes) in the row kernel. Union types can now be encoded into the row format for sorting and comparison ops

It handles both sparse and dense union modes by encoding each row as a null sentinel byte, followed by the type id byte, and then the encoded child row data. During decoding, rows are grouped by their type id and routed to the appropriate child converter

martin-g · 2025-11-14T12:21:27Z

arrow-row/src/lib.rs

+                offsets,
+                mode,
+            } => {
+                let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap();


Suggested change

let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap();

let union_array = array.as_any().downcast_ref::<UnionArray>().expect("expected Union array");

as at line 631

martin-g · 2025-11-14T12:22:54Z

arrow-row/src/lib.rs

+
+                let mut child_rows = Vec::with_capacity(converters.len());
+                for (type_id, converter) in converters.iter().enumerate() {
+                    let child_array = union_array.child(type_id as i8);


Here type_id is the index of the converter. It looks strange but it might be OK.
Could you use the items in type_ids instead ?

I think it makes sense because the type_id is the index of the child field types

Maybe we can document better that converters is indexed by type_id 🤔

martin-g · 2025-11-14T12:26:11Z

arrow-row/src/lib.rs

+            offsets: offsets_buf,
+            mode,
+        } => {
+            let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap();


Suggested change

let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap();

since it is not used

martin-g · 2025-11-14T12:29:58Z

arrow-row/src/lib.rs

+            let len = rows.len();
+
+            let DataType::Union(union_fields, mode) = &field.data_type else {
+                unreachable!()


Suggested change

unreachable!()

unreachable!("Expected a Union but got: {}", &field.data_type)

martin-g · 2025-11-14T12:32:25Z

arrow-row/src/lib.rs

            }
+            DataType::Union(fields, mode) => {
+                // similar to dictionaries and lists, we set descending to false and negate nulls_first
+                // since the encodedc ontents will be inverted if descending is set


Suggested change

// since the encodedc ontents will be inverted if descending is set

// since the encoded contents will be inverted if descending is set

alamb

Thanks @friendlymatthew -- this is looking good. I left some comments and @martin-g 's comments are good to review too

alamb · 2025-11-19T18:48:38Z

arrow-row/src/lib.rs

+
+                let mut child_rows = Vec::with_capacity(converters.len());
+                for (type_id, converter) in converters.iter().enumerate() {
+                    let child_array = union_array.child(type_id as i8);


I think it makes sense because the type_id is the index of the child field types

Maybe we can document better that converters is indexed by type_id 🤔

alamb · 2025-11-19T18:51:01Z

arrow-row/src/lib.rs

+    Union {
+        child_rows: Vec<Rows>,
+        type_ids: ScalarBuffer<i8>,
+        offsets: Option<ScalarBuffer<i32>>,


strictly speaking the mode is redundant here -- if there are no offsets, then the mode is sparse, otherwise the mode is dense. You could probably simplify the code if you removed the redundancy

alamb · 2025-11-19T18:52:19Z

arrow-row/src/lib.rs

+                        (UnionMode::Dense, Some(o)) => o[i] as usize,
+                        (UnionMode::Sparse, None) => i,
+                        foreign => {
+                            unreachable!("invalid union mode/offsets combination: {foreign:?}")


see above for a way to simplify this (don't hold mode too)

alamb · 2025-11-19T18:55:07Z

arrow-row/src/lib.rs

    }
+
+    #[test]
+    fn test_sparse_union() {


can you please also add tests here for union arrays that have nulls? Specifically for a union array that has a null buffer

alamb · 2025-11-19T18:55:28Z

arrow-row/src/lib.rs

+
+            for (idx, row) in rows.iter_mut().enumerate() {
+                // skip the null sentinel
+                let mut cursor = 1;


I think you need to look at the null byte to recover nulls 🤔

github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2025

friendlymatthew force-pushed the friendlymatthew/union-row-converter branch 2 times, most recently from 9a62f3c to 2cd0253 Compare November 13, 2025 22:03

Initial implementation of union row converter

5559011

friendlymatthew force-pushed the friendlymatthew/union-row-converter branch from 2cd0253 to 5559011 Compare November 13, 2025 22:09

martin-g reviewed Nov 14, 2025

View reviewed changes

Properly encode null sentinel

e365f92

alamb mentioned this pull request Nov 14, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-17 apache/datafusion#18711

Open

39 tasks

This was referenced Nov 14, 2025

Make UnionArrays hashable apache/datafusion#18717

Closed

Hash UnionArrays apache/datafusion#18718

Merged

Add UnionArray tests exercising hashing, group-by, distinct, and aggregates apache/datafusion#18791

Open

alamb reviewed Nov 19, 2025

View reviewed changes

	let union_array = array.as_any().downcast_ref::<UnionArray>().unwrap();
	let union_array = array.as_any().downcast_ref::<UnionArray>().expect("expected Union array");

	unreachable!()
	unreachable!("Expected a Union but got: {}", &field.data_type)

	// since the encodedc ontents will be inverted if descending is set
	// since the encoded contents will be inverted if descending is set

Add support for Union types in RowConverter #8839

Are you sure you want to change the base?

Add support for Union types in RowConverter #8839

Uh oh!

Conversation

friendlymatthew commented Nov 13, 2025

Which issue does this PR close?

Rationale for this change

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for `Union` types in `RowConverter` #8839

Add support for `Union` types in `RowConverter` #8839