-
Notifications
You must be signed in to change notification settings - Fork 1k
Add ArrowError::AvroError, remaining types and roundtrip tests to arrow-avro,
#8595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ArrowError::AvroError, remaining types and roundtrip tests to arrow-avro,
#8595
Conversation
…nosecond precision timestamps, and enhance Avro writer array type encoders - Introduced `TimestampNanos` codec for nanosecond precision timestamps in `arrow-avro`. - Extended writer support for new data types, including `ListView`, `LargeListView`, and `FixedSizeList`. - Implemented safe conversions for second-to-millisecond scaling in `Time32` and `Timestamp` encoders. - Improved extensibility for array encoders to include `BinaryView` and `Utf8View`. - Added corresponding unit tests for each enhancement.
c00fe62 to
3a9ec80
Compare
|
@mbrobbel @alamb @nathaniel-d-ef I added the remaining round trip tests along with the remaining encoder and decoder types. For the types Avro doesn't natively support I left detailed user friendly errors. I plan to come back and add Sparse Union support + improve the custom typing for 58.0.0 release. This should be the last PR before Also would we be able to get this one into 57.0.0? |
d79b989 to
698c614
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple notes.
| } | ||
|
|
||
| /// Time32(Second) to Avro time-millis (int), via safe scaling by 1000 | ||
| struct Time32SecondsToMillisEncoder<'a>(&'a PrimitiveArray<Time32SecondType>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a bit DRYer to have these implement a SecondsToMillis trait or something. I'm okay with this for our needs now though, unless you feel like making that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great callout and it crossed my mind as well. Initially I tried to abstract TimestampSecondsToMillisEncoder and Time32SecondsToMillisEncoder several different ways, however each attempt resulted in more complexity and more code. I tried approaches with involving traits, generics, and macros.
So I went with the rule of three on this one. My thinking is if the need for a third scaling encoder comes up, I could tackle it then.
| } | ||
|
|
||
| /// Build Avro JSON from an Arrow [`ArrowSchema`], applying the given null‑union order. | ||
| /// Build Avro JSON from an Arrow [`ArrowSchema`], applying the given null‑union order and optionally stripping internal Arrow metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance we run into issues here with an all-or-nothing approach to removing? Isn't the metadata simply ignored downstream unless it's specifically being referenced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't. By adding metadata that is irrelevant to an Avro Schema, we are just polluting the schema and complicating round tripping imo.
I think there's actually a risk of us running into issues by keeping the metadata present. I can foresee future contributors adding Reader behavior based on internal metadata that unnecessarily complicate decoder behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good; I had to probe because I've definitely debugged one or two things in the past where a valid property vanished because of a setting 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah completely understand. In fact that was a motivation for this change actually. This setting to strip out the metadata is only used prior to writing the Avro schema to an OCF file and it's not publicly accessible outside of the crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838 and @nathaniel-d-ef
| pub(crate) fn from_arrow_with_options( | ||
| schema: &ArrowSchema, | ||
| null_order: Option<Nullability>, | ||
| options: Option<AvroSchemaOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
| /// Error during JSON-related operations. | ||
| JsonError(String), | ||
| /// Error during Avro-related operations. | ||
| AvroError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an API change so we should get it in before arrow-57 is released
arrow-avroArrowError::AvroError, remaining types and roundtrip tests to arrow-avro,
Which issue does this PR close?
Rationale for this change
This PR brings Arrow-Avro round‑trip coverage up to date with modern Arrow types and the latest Avro logical types. In particular, Avro 1.12 adds
timestamp-nanosandlocal-timestamp-nanos. Enabling these logical types and filling in missing Avro writer encoders for Arrow’s newer view and list families allows lossless read/write and simpler pipelines.It also hardens timestamp/time scaling in the writer to avoid silent overflow when converting seconds to milliseconds, surfacing a clear error instead.
What changes are included in this PR?
TimestampNanos(bool)codec inarrow-avrothat maps Avrotimestamp-nanos/local-timestamp-nanosto ArrowTimestamp(Nanosecond, tz). The reader/decoder, union field kinds, and ArrowDataTypemapping are all extended accordingly. Logical type detection is wired through bothlogicalTypeand thearrowTimeUnit="nanosecond"attribute.logicalType="uuid"fields, preserve that logical type in Arrow field metadata so writers can round‑trip it back to Avro.ListView,LargeListView, andFixedSizeList, and extend array encoder support toBinaryViewandUtf8View. (See large additions inwriter/encoder.rs.)Time32/Timestampencoders to prevent overflow; encoding now returns a clearInvalidArgumenterror in those cases.AvroSchemaOptionswithnull_orderandstrip_metadataflags so Avro JSON can be built while optionally omitting internal Arrow keys during round‑trip schema generation.Are these changes tested?
Yes.
Timestamp(Nanosecond, tz)behavior for UTC and local timestamps and for nullable unions.Are there any user-facing changes?
N/A since
arrow-avrois not public yet.