-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
StringView / BinaryView were added to the Arrow format that make it more suitable for certain types of operations on strings. Specifically when filtering with string data, creating the output StringArray requires copying the strings to a new, packed binary buffer.
GenericByteViewArrary was designed to solve this limitation and the arrow-rs implementation, tracked by apache/arrow-rs#5374, is now complete enough to start adding support into DataFusion
I think we can improve performance in certain cases by using StringView (this is also described in more details in the Pola.rs blog post)
- Reading strings / binary from Parquet files as
StringViewArray/BinaryViewArrayrather thanStringArray/BinaryArraysaves a copy (and @ariesdevil is quite close to having it integrated into the parquet reader ImplementStringViewArrayandBinaryViewArrayreading/writing in parquet arrow-rs#5530) - Evaluating predicates on string expressions (for example
substr(url, 4) = 'http') as the intermediate result ofsubstrcan be called without copying string values
Describe the solution you'd like
I would like to support StringView / BinaryView support in DataFusion.
While my primary usecase is for reading data from parquet, I think teaching DataFusion to use StringView (at least as intermediate values when evaluating expressions may help significantly
Development branch: string-view
Since this feature requires upstream arrow-rs support apache/arrow-rs#5374 that is not yet released we plan to do development on a string-view feature branch:
https://github.com/apache/datafusion/tree/string-view
Task List
Here are some high level tasks (I can help flesh these out for anyone who is interested in helping)
- Implement
arrow_castsupport forStringViewandBinaryView#10920 - Implement equality
=and inequality<>support forStringView#10919 - Improve filter predicates with
Utf8Viewliterals #10998 - Implement
LIKEfor StringView arrays #11024 - Implement equality = and inequality <> support for
BinaryView#10996 - use StringViewArray when reading String columns from Parquet #10921
- Create a branch for StringView / BinaryView development #10961
- Implement
LIKEfor StringView arrays #11024 - Implement
REGEXP_REPLACEfor StringView #11025 - Support for
LargeStringandLargeBinaryforStringViewandBinaryView#11023 -
likebenchmark for StringView arrow-rs#5936 - Add generic coercion rule from
Utf8View->Utf8/BinaryView->Binaryfor compatibility - File tickets for adding native Utf8View / BinaryView support to other functions which would benefit
- Establish pattern for implementing optimized string functions for StringViewArray
- Enable reading StringView by default from Parquet (
schema_force_string_view) by default #11682 - Implement native
StringViewsupport for CharacterLength #11677
Describe alternatives you've considered
No response
Additional context
Polars implemented it recently in rust so that can serve as a motivation
Blog Post https://pola.rs/posts/polars-string-type/
https://twitter.com/RitchieVink/status/1749466861069115790
Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
Related PRs:
pola-rs/polars#13748
pola-rs/polars#13839
pola-rs/polars#13489