Skip to content

Incorrect Repeated Field Schema Inference #1681

@tustvold

Description

@tustvold

Describe the bug

The schema inference logic in parquet does not infer the correct nullability for nested types.

For example

let message_type = "
message test_schema {
  OPTIONAL INT32 leaf1;
  REPEATED GROUP outerGroup {
    OPTIONAL INT32 leaf2;
    REPEATED GROUP innerGroup {
      OPTIONAL INT32 leaf3;
    }
  }
}
";
let parquet_group_type = parse_message_type(message_type).unwrap();
let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type));
let converted_arrow_schema =
parquet_to_arrow_schema(&parquet_schema, None).unwrap();

Will infer innerGroup and outerGroup as nullable lists with nullable elements, when they are neither.

To Reproduce

See test

Expected behavior

The nullability should be inferred correctly

Additional context

This has likely been hidden by the lack of support for repeated fields - #1680

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions