-
Notifications
You must be signed in to change notification settings - Fork 1k
builder: Error when concatenating binary arrays would exceed offset size #8252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builder: Error when concatenating binary arrays would exceed offset size #8252
Conversation
| /// [`LargeStringArray`]: crate::array::LargeStringArray | ||
| pub trait OffsetSizeTrait: ArrowNativeType + std::ops::AddAssign + Integer { | ||
| pub trait OffsetSizeTrait: | ||
| ArrowNativeType + std::ops::AddAssign + Integer + num::CheckedAdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so familiar with this so I don't know whether a num::CheckedAdd okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok in general, but this might be a "breaking API change" potentially if any downstream crates have implemented the OffsetSizeTrait
However, that seems unlikely so maybe it is ok
| // and reserve the necessary capacity, it's still slower) | ||
| let mut intermediate = Vec::with_capacity(offsets.len() - 1); | ||
|
|
||
| if shift.checked_add(&offsets[offsets.len() - 1]).is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just check once using the last-offset
| /// (this means that underlying null values are copied as is). | ||
| #[inline] | ||
| pub fn append_array(&mut self, array: &GenericByteArray<T>) { | ||
| pub fn append_array(&mut self, array: &GenericByteArray<T>) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
// Shifting all the offsets
let shift: T::Offset = self.next_offset() - offsets[0];
The self.next_offset() would read the length of binary builder, if:
- There're k batches, and the batch before
kth batch exceeds the offsets, when merge next array,self.next_offset()would raise error, causing aexpectcalled inself.next_offset() - If final batch exceeds, it would leaving overflow offsets in
intermediate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is also a public API change (and thus would have to wait for the next major release (scheduled for october)
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteBuilder.html#method.append_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to me, I'm using this patch in my own branch now
…pleFU/arrow-rs into enhance-append-handling-in-byte-array
|
@alamb would you mind take a look? |
|
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mapleFU -- the code here looks great to me. I kicked off the benchmarks to make sure, but I don't expect any change.
I think the only potential concern is that these are API changes so we will have to be a bit diligent about when we merge them in
| /// [`LargeStringArray`]: crate::array::LargeStringArray | ||
| pub trait OffsetSizeTrait: ArrowNativeType + std::ops::AddAssign + Integer { | ||
| pub trait OffsetSizeTrait: | ||
| ArrowNativeType + std::ops::AddAssign + Integer + num::CheckedAdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok in general, but this might be a "breaking API change" potentially if any downstream crates have implemented the OffsetSizeTrait
However, that seems unlikely so maybe it is ok
| /// (this means that underlying null values are copied as is). | ||
| #[inline] | ||
| pub fn append_array(&mut self, array: &GenericByteArray<T>) { | ||
| pub fn append_array(&mut self, array: &GenericByteArray<T>) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is also a public API change (and thus would have to wait for the next major release (scheduled for october)
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteBuilder.html#method.append_array
|
cc @rluvaton who I think contributed to this code recently. If you have a moment to review I would appreciate it |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
looking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you
|
this contain breaking changes, could you please update the pr description to reflect that (there are 2 breaking changes, append_array now return different types and offset size trait change. also the pr name should be byte array rather than binary array, but this is just nitpick |
|
done |
|
Thank you for your patience @mapleFU |
Which issue does this PR close?
Rationale for this change
When concat array, the final value might:
What changes are included in this PR?
Prevent from offset here
Are these changes tested?
Are there any user-facing changes?
Breaking changes:
Result<()>