Separate Parquet -> Arrow Schema Conversion From ArrayBuilder

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Currently ArrayBuilderContext has multiple responsibilities

* Parquet -> Arrow schema conversion
* Constructing the necessary ArrayBuilders
* Projection pushdown

The result is not only immensely confusing but also:

* Overlaps with code in `parquet_to_arrow_schema_by_columns`
* Hard to test - #1484 
* Potentially inconsistent - #1652
* Buggy - #1654

**Describe the solution you'd like**

Create an `ArrowSchemaConverter` which takes a `FileMetaData` and an optional column projection and returns `ParquetField` where

```
struct ParquetField {
    rep_level: i16,
    def_level: i16,
    arrow_type: DataType,
    parquet_type: TypePtr,
    leaf_idx: Option<usize>,
    children: Vec<ParquetField>
}
```

This can then easily be used to generate the Schema or ArrayReader for the projected columns, replacing the existing logic.

As FileMetaData can easily be created, this should be significantly easier to test than the current logic.

**Describe alternatives you've considered**

Some of the bugs can be worked around manually but the code is getting increasingly difficult to reason about, and I think it has reached a point where we need to spend some time to refactor it.

**Additional context**

https://github.com/apache/arrow-rs/issues/1654
https://github.com/apache/arrow-rs/issues/1652
https://github.com/apache/arrow-rs/issues/1459


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate Parquet -> Arrow Schema Conversion From ArrayBuilder #1655

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Separate Parquet -> Arrow Schema Conversion From ArrayBuilder #1655

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions