-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently ArrayBuilderContext has multiple responsibilities
- Parquet -> Arrow schema conversion
- Constructing the necessary ArrayBuilders
- Projection pushdown
The result is not only immensely confusing but also:
- Overlaps with code in
parquet_to_arrow_schema_by_columns - Hard to test - Improve Unit Test Coverage of ArrayReaderBuilder #1484
- Potentially inconsistent - Inconsistent Arrow Schema When Projecting Nested Parquet File #1652
- Buggy - parquet_to_arrow_schema_by_columns Incorrectly Handles Nested Types #1654
Describe the solution you'd like
Create an ArrowSchemaConverter which takes a FileMetaData and an optional column projection and returns ParquetField where
struct ParquetField {
rep_level: i16,
def_level: i16,
arrow_type: DataType,
parquet_type: TypePtr,
leaf_idx: Option<usize>,
children: Vec<ParquetField>
}
This can then easily be used to generate the Schema or ArrayReader for the projected columns, replacing the existing logic.
As FileMetaData can easily be created, this should be significantly easier to test than the current logic.
Describe alternatives you've considered
Some of the bugs can be worked around manually but the code is getting increasingly difficult to reason about, and I think it has reached a point where we need to spend some time to refactor it.
Additional context