-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Incremental compile times feel like they have gotten slower some time ago. While investigating I notices that a large amount of llvm code is generated for the comparison kernels, especially dyn and dict versions. These end up calling the generic BooleanArray::from_iter method which therefore gets duplicated for each possible iterator type.
$ cargo llvm-lines | head -n30
Compiling arrow v16.0.0 (/home/jhorstmann/Source/github/apache/arrow-rs/arrow)
Finished dev [unoptimized + debuginfo] target(s) in 35.58s
Lines Copies Function name
----- ------ -------------
3156218 (100%) 73225 (100%) (TOTAL)
292166 (9.3%) 1164 (1.6%) <arrow::array::array_boolean::BooleanArray as core::iter::traits::collect::FromIterator<Ptr>>::from_iter
112200 (3.6%) 1545 (2.1%) core::iter::traits::iterator::Iterator::fold
92112 (2.9%) 912 (1.2%) arrow::compute::kernels::comparison::cmp_dict
85538 (2.7%) 1236 (1.7%) <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::fold::enumerate::{{closure}}
84972 (2.7%) 1164 (1.6%) <arrow::array::array_boolean::BooleanArray as core::iter::traits::collect::FromIterator<Ptr>>::from_iter::{{closure}}
83903 (2.7%) 1478 (2.0%) core::iter::adapters::map::map_fold::{{closure}}
80639 (2.6%) 1533 (2.1%) core::option::Option<T>::map
80304 (2.5%) 912 (1.2%) arrow::compute::kernels::comparison::cmp_dict::{{closure}}
78427 (2.5%) 1478 (2.0%) <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
65942 (2.1%) 1236 (1.7%) <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::fold
59070 (1.9%) 1517 (2.1%) core::iter::traits::iterator::Iterator::for_each
39108 (1.2%) 2360 (3.2%) core::iter::adapters::map::Map<I,F>::new
38034 (1.2%) 137 (0.2%) arrow::util::trusted_len::trusted_len_unzip
37336 (1.2%) 1511 (2.1%) core::iter::traits::iterator::Iterator::for_each::call::{{closure}}
33218 (1.1%) 62 (0.1%) core::slice::sort::partition_in_blocks
29970 (0.9%) 81 (0.1%) arrow::compute::kernels::take::take_primitive
27904 (0.9%) 2360 (3.2%) core::iter::traits::iterator::Iterator::map
27631 (0.9%) 249 (0.3%) <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::next
27392 (0.9%) 202 (0.3%) <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::size_hint
26906 (0.9%) 140 (0.2%) arrow::array::array_primitive::PrimitiveArray<T>::from_trusted_len_iter
26764 (0.8%) 136 (0.2%) arrow::buffer::mutable::MutableBuffer::try_from_trusted_len_iter
25676 (0.8%) 277 (0.4%) core::iter::traits::iterator::Iterator::try_fold
22117 (0.7%) 257 (0.4%) <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
19091 (0.6%) 215 (0.3%) <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::next
18150 (0.6%) 415 (0.6%) core::option::Option<T>::ok_or_else
17976 (0.6%) 168 (0.2%) arrow::buffer::mutable::MutableBuffer::from_trusted_len_iter_bool
17430 (0.6%) 1478 (2.0%) core::iter::adapters::map::map_fold
Removing the dyn and dict comparison kernels gives the following number of lines (but this is of course not a realistic option):
$ cargo llvm-lines | head -n30
Compiling arrow v16.0.0 (/home/jhorstmann/Source/github/apache/arrow-rs/arrow)
Lines Copies Function name
----- ------ -------------
1737110 (100%) 41764 (100%) (TOTAL)
49753 (2.9%) 939 (2.2%) core::option::Option<T>::map
38034 (2.2%) 137 (0.3%) arrow::util::trusted_len::trusted_len_unzip
33218 (1.9%) 62 (0.1%) core::slice::sort::partition_in_blocks
30112 (1.7%) 385 (0.9%) core::iter::traits::iterator::Iterator::fold
29970 (1.7%) 81 (0.2%) arrow::compute::kernels::take::take_primitive
26906 (1.5%) 140 (0.3%) arrow::array::array_primitive::PrimitiveArray<T>::from_trusted_len_iter
26764 (1.5%) 136 (0.3%) arrow::buffer::mutable::MutableBuffer::try_from_trusted_len_iter
25676 (1.5%) 277 (0.7%) core::iter::traits::iterator::Iterator::try_fold
22117 (1.3%) 257 (0.6%) <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
19091 (1.1%) 215 (0.5%) <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::next
18337 (1.1%) 318 (0.8%) core::iter::adapters::map::map_fold::{{closure}}
16947 (1.0%) 318 (0.8%) <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
16576 (1.0%) 64 (0.2%) arrow::compute::kernels::cast::pack_numeric_to_dictionary
16177 (0.9%) 261 (0.6%) <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
15683 (0.9%) 41 (0.1%) <arrow::array::array_string::GenericStringArray<OffsetSize> as core::iter::traits::collect::FromIterator<core::option::Option<Ptr>>>::from_iter
14998 (0.9%) 838 (2.0%) core::iter::adapters::map::Map<I,F>::new
14586 (0.8%) 86 (0.2%) arrow::buffer::mutable::MutableBuffer::from_trusted_len_iter
14551 (0.8%) 96 (0.2%) arrow::buffer::mutable::MutableBuffer::extend_from_iter
14145 (0.8%) 77 (0.2%) <arrow::array::array_primitive::PrimitiveArray<T> as core::iter::traits::collect::FromIterator<Ptr>>::from_iter
13912 (0.8%) 39 (0.1%) arrow::array::array::print_long_array
13830 (0.8%) 357 (0.9%) core::iter::traits::iterator::Iterator::for_each
13743 (0.8%) 27 (0.1%) arrow::compute::kernels::filter::filter_primitive
13726 (0.8%) 62 (0.1%) core::slice::sort::partition
13125 (0.8%) 98 (0.2%) <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold::{{closure}}
12256 (0.7%) 64 (0.2%) arrow::array::builder::PrimitiveDictionaryBuilder<K,V>::append
11804 (0.7%) 62 (0.1%) core::slice::sort::partition_equal
11215 (0.6%) 88 (0.2%) <arrow::buffer::immutable::Buffer as core::iter::traits::collect::FromIterator<T>>::from_iter
This seems to indicate that nearly half of the llvm code is caused by the comparison kernels.
Describe the solution you'd like
I'm not sure whether there is a good solution for this, but wanted to document these findings.
The kernels use macros very extensively, but replacing these with generic functions will probably lead to the same amount of llvm lines because of monomorphization. Maybe some of the non-generic common code could be extracted into helper methods.
Describe alternatives you've considered
The dyn kernels could also be put behind a feature flag, but that would again complicate the test matrix.