Skip to content

Separate Parquet -> Arrow Schema Conversion From ArrayBuilder #1655

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrayBuilderContext has multiple responsibilities

  • Parquet -> Arrow schema conversion
  • Constructing the necessary ArrayBuilders
  • Projection pushdown

The result is not only immensely confusing but also:

Describe the solution you'd like

Create an ArrowSchemaConverter which takes a FileMetaData and an optional column projection and returns ParquetField where

struct ParquetField {
    rep_level: i16,
    def_level: i16,
    arrow_type: DataType,
    parquet_type: TypePtr,
    leaf_idx: Option<usize>,
    children: Vec<ParquetField>
}

This can then easily be used to generate the Schema or ArrayReader for the projected columns, replacing the existing logic.

As FileMetaData can easily be created, this should be significantly easier to test than the current logic.

Describe alternatives you've considered

Some of the bugs can be worked around manually but the code is getting increasingly difficult to reason about, and I think it has reached a point where we need to spend some time to refactor it.

Additional context

#1654
#1652
#1459

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions