Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions arrow-buffer/src/buffer/mutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -557,20 +557,19 @@ impl MutableBuffer {
/// as it eliminates the conditional `Iterator::next`
#[inline]
pub fn collect_bool<F: FnMut(usize) -> bool>(len: usize, mut f: F) -> Self {
let mut buffer = Self::new(bit_util::ceil(len, 64) * 8);
let mut buffer: Vec<u64> = Vec::with_capacity(bit_util::ceil(len, 64));

let chunks = len / 64;
let remainder = len % 64;
for chunk in 0..chunks {
buffer.extend((0..chunks).map(|chunk| {
let mut packed = 0;
for bit_idx in 0..64 {
let i = bit_idx + chunk * 64;
packed |= (f(i) as u64) << bit_idx;
}

// SAFETY: Already allocated sufficient capacity
unsafe { buffer.push_unchecked(packed) }
}
packed
}));

if remainder != 0 {
let mut packed = 0;
Expand All @@ -579,10 +578,10 @@ impl MutableBuffer {
packed |= (f(i) as u64) << bit_idx;
}

// SAFETY: Already allocated sufficient capacity
unsafe { buffer.push_unchecked(packed) }
buffer.push(packed)
}

let mut buffer: MutableBuffer = buffer.into();
buffer.truncate(bit_util::ceil(len, 8));
buffer
}
Expand Down
16 changes: 7 additions & 9 deletions arrow-ord/src/cmp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ use arrow_array::{
GenericByteArray, GenericByteViewArray, downcast_primitive_array,
};
use arrow_buffer::bit_util::ceil;
use arrow_buffer::{BooleanBuffer, MutableBuffer, NullBuffer};
use arrow_buffer::{BooleanBuffer, NullBuffer};
use arrow_schema::ArrowError;
use arrow_select::take::take;
use std::cmp::Ordering;
Expand Down Expand Up @@ -390,14 +390,14 @@ fn take_bits(v: &dyn AnyDictionaryArray, buffer: BooleanBuffer) -> BooleanBuffer

/// Invokes `f` with values `0..len` collecting the boolean results into a new `BooleanBuffer`
///
/// This is similar to [`MutableBuffer::collect_bool`] but with
/// This is similar to [`arrow_buffer::MutableBuffer::collect_bool`] but with
/// the option to efficiently negate the result
fn collect_bool(len: usize, neg: bool, f: impl Fn(usize) -> bool) -> BooleanBuffer {
let mut buffer = MutableBuffer::new(ceil(len, 64) * 8);
let mut buffer = Vec::with_capacity(ceil(len, 64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make the neg a generic argument on MutableBuffer::collect_bool and then avoid the duplication (as a follow on PR)

Or maybe make a collect_bool function in bit_util that returns a Vec and have the mutable buffer and this one call it 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I was thinking about it as well.


let chunks = len / 64;
let remainder = len % 64;
for chunk in 0..chunks {
buffer.extend((0..chunks).map(|chunk| {
let mut packed = 0;
for bit_idx in 0..64 {
let i = bit_idx + chunk * 64;
Expand All @@ -407,9 +407,8 @@ fn collect_bool(len: usize, neg: bool, f: impl Fn(usize) -> bool) -> BooleanBuff
packed = !packed
}

// SAFETY: Already allocated sufficient capacity
unsafe { buffer.push_unchecked(packed) }
}
packed
}));

if remainder != 0 {
let mut packed = 0;
Expand All @@ -421,8 +420,7 @@ fn collect_bool(len: usize, neg: bool, f: impl Fn(usize) -> bool) -> BooleanBuff
packed = !packed
}

// SAFETY: Already allocated sufficient capacity
unsafe { buffer.push_unchecked(packed) }
buffer.push(packed);
}
BooleanBuffer::new(buffer.into(), 0, len)
}
Expand Down
13 changes: 8 additions & 5 deletions arrow-select/src/take.rs
Original file line number Diff line number Diff line change
Expand Up @@ -422,9 +422,10 @@ fn take_native<T: ArrowNativeType, I: ArrowPrimitiveType>(
.enumerate()
.map(|(idx, index)| match values.get(index.as_usize()) {
Some(v) => *v,
None => match n.is_null(idx) {
true => T::default(),
false => panic!("Out-of-bounds index {index:?}"),
// SAFETY: idx<indices.len()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read this several times to convince myself it is correct -- namely that idx i doesn't come from indices (provided by the user) but instead comes from iterating indices

I found this whole method actually pretty confusing as there are multiple things called values and indices (and indices.values()..)

I also double checked that there is a test for out of bounds indexes here:
https://github.com/apache/arrow-rs/blob/f93da94e61e731344ce84146dee946a94fe36602/arrow-select/src/take.rs#L2084-L2083

None => match unsafe { n.inner().value_unchecked(idx) } {
false => T::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan do you know why this even bothers to look at the null buffer again? I realize you didn't change this code, but it seems to me like checking n.inner() (the nulls) is unecessary - it was already implicitly checked by calling values.get() (which returns Some/None).

It seems like all this is doing is re-checking that value() and the nulls match up.

so, TLDR I suggest we remove this clause entirely (could be a follow on PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't remove it - it checks the indices value is null as well to make sure out of bounds on a non-null value leads to a panic.

So currently:

  • out of bound with a null index value => default value (+ null in the output)
  • out of bounds with a non-null value => panic

We could consider out of bounds either always panics or always gives a default (0) value but the current API (and tests) requires it to be this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks correct to me, but is not really intuitive. In our microbenchmarks the current code might faster because it avoids a branch on the validity bit in the happy case, but I'm not sure that will still be faster on larger inputs, or if a larger amount of indices is null.

I would find the following more intuitive, and hopefully not much slower (slice, range and zip iteration should all be TrustedLen):

    indices
            .values()
            .iter()
            .zip((0..n.len()).map(move |i| unsafe { n.inner().value_unchecked(i) }))
            .map(|(index, valid)| if valid {
                values[index.as_usize()]
            } else {
                T::default()
            })
            .collect()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR with this change:

true => panic!("Out-of-bounds index {index:?}"),
},
})
.collect(),
Expand All @@ -448,8 +449,10 @@ fn take_bits<I: ArrowPrimitiveType>(
let mut output_buffer = MutableBuffer::new_null(len);
let output_slice = output_buffer.as_slice_mut();
nulls.valid_indices().for_each(|idx| {
if values.value(indices.value(idx).as_usize()) {
bit_util::set_bit(output_slice, idx);
// SAFETY: idx is a valid index in indices.nulls() --> idx<indices.len()
if values.value(unsafe { indices.value_unchecked(idx).as_usize() }) {
// SAFETY: MutableBuffer was created with space for indices.len() bit, and idx < indices.len()
unsafe { bit_util::set_bit_raw(output_slice.as_mut_ptr(), idx) };
}
});
BooleanBuffer::new(output_buffer.into(), 0, len)
Expand Down
Loading