-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit c52db65
authored
Added arrow-avro schema resolution foundations and type promotion (#8047)
# Which issue does this PR close?
- Part of #4886
# Rationale for this change
This change introduces the foundation in `codec.rs` for supporting for
Avro schema evolution, a key feature of the Avro specification. It
enables reading Avro data when the writer's schema and the reader's
schema do not match exactly but are compatible according to Avro's
resolution rules. This makes data consumption more robust and flexible.
This approach focuses on "annotating" each `AvroDataType` with optional
`ResolutionInfo` and then building the `Codec` using the
`reader_schema`. This `ResolutionInfo` will be used downstream in my
next PR by the `RecordDecoder` to efficiently read and decode the raw
record bytes into the `reader_schema`.
Once this is merged in, promotion schema resolution support will need to
be added to the `RecordDecoder` in a follow-up PR. These `RecordDecoder`
updates will resemble this:
```rust
Promotion::IntToLong => Int32ToInt64(BufferBuilder::new(DEFAULT_CAPACITY)),
Promotion::IntToFloat => Int32ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
Promotion::IntToDouble => Int32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
Promotion::LongToFloat => Int64ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
Promotion::LongToDouble => Int64ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
Promotion::FloatToDouble => {
Float32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY))
}
Promotion::BytesToString => BytesToString(
OffsetBufferBuilder::new(DEFAULT_CAPACITY),
BufferBuilder::new(DEFAULT_CAPACITY),
),
Promotion::StringToBytes => StringToBytes(
OffsetBufferBuilder::new(DEFAULT_CAPACITY),
BufferBuilder::new(DEFAULT_CAPACITY),
),
```
# What changes are included in this PR?
- **Schema Resolution Logic**: The core of this PR is the new schema
resolution logic, which is encapsulated in the `Maker` struct. This
handles:
- **Type Promotions**: E.g., promoting `int` to `long` or `string` to
`bytes`.
- **Default Values**: Using default values from the reader's schema when
a field is missing in the writer's schema.
- **Record Evolution**: Resolving differences in record fields between
the writer and reader schemas. This includes adding or removing fields.
- **Enum Evolution**: Mapping enum symbols between the writer's and
reader's schemas.
- **New Data Structures**: Several new data structures have been added
to support schema resolution:
- `ResolutionInfo`: An enum that captures the necessary information for
resolving schema differences.
- `ResolvedRecord`: A struct that holds the mapping between writer and
reader record fields.
- `AvroLiteral`: Represents Avro default values.
- `Promotion`: An enum for different kinds of type promotions.
- `EnumMapping`: A struct for enum symbol mapping.
- **Updated `AvroFieldBuilder`**: The `AvroFieldBuilder` has been
updated to accept both a writer's and an optional reader's schema to
facilitate schema resolution.
- **`PartialEq` Derivations**: `PartialEq` has been derived for several
structs to simplify testing.
- **Refactoring**: The schema parsing logic has been refactored from a
standalone function into the new `Maker` struct for better organization.
# Are these changes tested?
Yes, new unit tests have been added to verify the schema resolution
logic, including tests for type promotions and handling of default
values.
# Are there any user-facing changes?
N/A
# Follow-up PRs
- Promotion Schema Resolution support in `RecordDecoder`
- Default Value Schema resolution support (codec + decoder)
- Enum mapping Schema resolution support (codec + decoder)
- Skip Value Schema resolution support (codec + decoder)
- Record resolution support (codec + decoder)1 parent 521aa73 commit c52db65Copy full SHA for c52db65
0 commit comments