Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 29, 2025

Which issue does this PR close?

Rationale for this change

When concat array, the final value might:

  1. unwrap when adding
  2. unlucky, some even not unwrap, leaving a negative offset at ending of offsets, causing coredump

What changes are included in this PR?

Prevent from offset here

Are these changes tested?

  • To add ( I don't know memory size would be too large for this?)

Are there any user-facing changes?

Breaking changes:

  1. append_array now return Result<()>
  2. OffsetSize trait change to have CheckedAdd

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 29, 2025
/// [`LargeStringArray`]: crate::array::LargeStringArray
pub trait OffsetSizeTrait: ArrowNativeType + std::ops::AddAssign + Integer {
pub trait OffsetSizeTrait:
ArrowNativeType + std::ops::AddAssign + Integer + num::CheckedAdd
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with this so I don't know whether a num::CheckedAdd okay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok in general, but this might be a "breaking API change" potentially if any downstream crates have implemented the OffsetSizeTrait

However, that seems unlikely so maybe it is ok

// and reserve the necessary capacity, it's still slower)
let mut intermediate = Vec::with_capacity(offsets.len() - 1);

if shift.checked_add(&offsets[offsets.len() - 1]).is_none() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just check once using the last-offset

@mapleFU mapleFU requested a review from alamb August 29, 2025 19:16
/// (this means that underlying null values are copied as is).
#[inline]
pub fn append_array(&mut self, array: &GenericByteArray<T>) {
pub fn append_array(&mut self, array: &GenericByteArray<T>) -> Result<(), ArrowError> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:

// Shifting all the offsets
            let shift: T::Offset = self.next_offset() - offsets[0];

The self.next_offset() would read the length of binary builder, if:

  1. There're k batches, and the batch before kth batch exceeds the offsets, when merge next array, self.next_offset() would raise error, causing a expect called in self.next_offset()
  2. If final batch exceeds, it would leaving overflow offsets in intermediate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is also a public API change (and thus would have to wait for the next major release (scheduled for october)

https://docs.rs/arrow/latest/arrow/array/struct.GenericByteBuilder.html#method.append_array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to me, I'm using this patch in my own branch now

@mapleFU
Copy link
Member Author

mapleFU commented Sep 3, 2025

@alamb would you mind take a look?

@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing enhance-append-handling-in-byte-array (12a9ba8) to 3dcd23f diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=enhance-append-handling-in-byte-array
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mapleFU -- the code here looks great to me. I kicked off the benchmarks to make sure, but I don't expect any change.

I think the only potential concern is that these are API changes so we will have to be a bit diligent about when we merge them in

/// [`LargeStringArray`]: crate::array::LargeStringArray
pub trait OffsetSizeTrait: ArrowNativeType + std::ops::AddAssign + Integer {
pub trait OffsetSizeTrait:
ArrowNativeType + std::ops::AddAssign + Integer + num::CheckedAdd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok in general, but this might be a "breaking API change" potentially if any downstream crates have implemented the OffsetSizeTrait

However, that seems unlikely so maybe it is ok

/// (this means that underlying null values are copied as is).
#[inline]
pub fn append_array(&mut self, array: &GenericByteArray<T>) {
pub fn append_array(&mut self, array: &GenericByteArray<T>) -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is also a public API change (and thus would have to wait for the next major release (scheduled for october)

https://docs.rs/arrow/latest/arrow/array/struct.GenericByteBuilder.html#method.append_array

@alamb alamb added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Sep 5, 2025
@alamb alamb changed the title builder: Trying to fix binary array offset exceeds when building builder: Error when concatenating binary arrays would exceed offset size Sep 5, 2025
@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

cc @rluvaton who I think contributed to this code recently. If you have a moment to review I would appreciate it

@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

🤖: Benchmark completed

Details

group                                                          enhance-append-handling-in-byte-array    main
-----                                                          -------------------------------------    ----
concat 1024 arrays boolean 4                                   1.00     27.9±0.09µs        ? ?/sec      1.03     28.6±0.16µs        ? ?/sec
concat 1024 arrays i32 4                                       1.00     13.5±0.16µs        ? ?/sec      1.12     15.1±0.16µs        ? ?/sec
concat 1024 arrays str 4                                       1.00     54.3±0.75µs        ? ?/sec      1.01     54.9±0.33µs        ? ?/sec
concat boolean 1024                                            1.04    428.2±0.73ns        ? ?/sec      1.00    412.8±0.84ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.15     51.0±0.08µs        ? ?/sec      1.00     44.4±0.12µs        ? ?/sec
concat boolean nulls 1024                                      1.05    756.8±1.96ns        ? ?/sec      1.00    723.6±1.39ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.14    109.5±0.22µs        ? ?/sec      1.00     96.1±0.23µs        ? ?/sec
concat fixed size lists                                        1.00   735.8±48.98µs        ? ?/sec      1.11   815.7±47.85µs        ? ?/sec
concat i32 1024                                                1.00    390.3±4.03ns        ? ?/sec      1.01    394.5±1.05ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.04    210.8±8.55µs        ? ?/sec      1.00    202.7±4.38µs        ? ?/sec
concat i32 nulls 1024                                          1.01    708.7±1.80ns        ? ?/sec      1.00    700.5±1.68ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.00    266.7±5.53µs        ? ?/sec      1.05    279.8±9.08µs        ? ?/sec
concat str 1024                                                1.00     14.3±1.12µs        ? ?/sec      1.12     16.0±0.88µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    106.8±0.91ms        ? ?/sec      1.00    106.2±0.83ms        ? ?/sec
concat str nulls 1024                                          1.00      7.1±0.74µs        ? ?/sec      1.03      7.3±0.68µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.00     54.3±0.51ms        ? ?/sec      1.00     54.2±0.49ms        ? ?/sec
concat str_dict 1024                                           1.00      2.8±0.02µs        ? ?/sec      1.06      3.0±0.02µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      6.8±0.04µs        ? ?/sec      1.02      7.0±0.02µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.00      6.9±0.12µs        ? ?/sec      1.06      7.3±0.03µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.01     78.0±1.07µs        ? ?/sec      1.00     77.6±0.71µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.01     84.4±0.70µs        ? ?/sec      1.00     83.4±0.94µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.00     77.6±0.49µs        ? ?/sec      1.14     88.6±0.41µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.00     84.5±1.11µs        ? ?/sec      1.12     94.4±0.41µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     39.8±3.86µs        ? ?/sec      1.17     46.5±2.88µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     49.5±4.31µs        ? ?/sec      1.05     51.9±2.30µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing enhance-append-handling-in-byte-array (12a9ba8) to 3dcd23f diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=enhance-append-handling-in-byte-array
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 6, 2025

🤖: Benchmark completed

Details

group                                                          enhance-append-handling-in-byte-array    main
-----                                                          -------------------------------------    ----
concat 1024 arrays boolean 4                                   1.00     27.9±0.18µs        ? ?/sec      1.02     28.6±0.06µs        ? ?/sec
concat 1024 arrays i32 4                                       1.00     13.5±0.10µs        ? ?/sec      1.14     15.4±0.03µs        ? ?/sec
concat 1024 arrays str 4                                       1.00     54.7±0.60µs        ? ?/sec      1.01     55.1±0.62µs        ? ?/sec
concat boolean 1024                                            1.10    436.4±3.03ns        ? ?/sec      1.00    395.4±0.63ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.15     51.0±0.43µs        ? ?/sec      1.00     44.4±0.09µs        ? ?/sec
concat boolean nulls 1024                                      1.08    770.4±4.87ns        ? ?/sec      1.00    711.7±0.75ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.14    109.9±0.94µs        ? ?/sec      1.00     96.1±0.14µs        ? ?/sec
concat fixed size lists                                        1.00   728.4±18.06µs        ? ?/sec      1.04   759.9±48.69µs        ? ?/sec
concat i32 1024                                                1.00    391.1±0.74ns        ? ?/sec      1.05    411.4±1.36ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.11    228.1±3.29µs        ? ?/sec      1.00    204.8±6.84µs        ? ?/sec
concat i32 nulls 1024                                          1.00    708.9±9.63ns        ? ?/sec      1.03   733.5±13.59ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.08    297.1±2.57µs        ? ?/sec      1.00    275.2±8.93µs        ? ?/sec
concat str 1024                                                1.00     14.5±1.14µs        ? ?/sec      1.02     14.8±1.02µs        ? ?/sec
concat str 8192 over 100 arrays                                1.00    104.2±0.99ms        ? ?/sec      1.00    104.5±0.58ms        ? ?/sec
concat str nulls 1024                                          1.00      7.0±1.08µs        ? ?/sec      1.04      7.3±0.61µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.00     52.6±0.60ms        ? ?/sec      1.00     52.8±0.74ms        ? ?/sec
concat str_dict 1024                                           1.00      2.8±0.01µs        ? ?/sec      1.04      2.9±0.01µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      6.8±0.03µs        ? ?/sec      1.02      7.0±0.01µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.00      6.7±0.17µs        ? ?/sec      1.04      6.9±0.18µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     77.2±0.25µs        ? ?/sec      1.00     77.4±0.29µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.01     84.0±0.42µs        ? ?/sec      1.00     83.2±0.37µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.00     77.2±0.57µs        ? ?/sec      1.15     88.9±0.28µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.00     83.4±0.53µs        ? ?/sec      1.14     94.7±0.48µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     46.1±2.76µs        ? ?/sec      1.01     46.6±2.65µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     51.0±2.31µs        ? ?/sec      1.04     53.2±2.55µs        ? ?/sec

@rluvaton
Copy link
Member

rluvaton commented Sep 6, 2025

cc @rluvaton who I think contributed to this code recently. If you have a moment to review I would appreciate it

looking

Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

@rluvaton
Copy link
Member

rluvaton commented Sep 6, 2025

this contain breaking changes, could you please update the pr description to reflect that (there are 2 breaking changes, append_array now return different types and offset size trait change.

also the pr name should be byte array rather than binary array, but this is just nitpick

@mapleFU
Copy link
Member Author

mapleFU commented Sep 7, 2025

done

@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 15, 2025
@alamb alamb merged commit 28c7c52 into apache:main Sep 25, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 25, 2025

Thank you for your patience @mapleFU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compute::concat_batches might produces String exceeds i32 offsets

4 participants