Skip to content

Conversation

@jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Oct 13, 2025

Which issue does this PR close?

Rationale for this change

This PR brings Arrow-Avro round‑trip coverage up to date with modern Arrow types and the latest Avro logical types. In particular, Avro 1.12 adds timestamp-nanos and local-timestamp-nanos. Enabling these logical types and filling in missing Avro writer encoders for Arrow’s newer view and list families allows lossless read/write and simpler pipelines.

It also hardens timestamp/time scaling in the writer to avoid silent overflow when converting seconds to milliseconds, surfacing a clear error instead.

What changes are included in this PR?

  • Nanosecond timestamps: Introduces a TimestampNanos(bool) codec in arrow-avro that maps Avro timestamp-nanos / local-timestamp-nanos to Arrow Timestamp(Nanosecond, tz). The reader/decoder, union field kinds, and Arrow DataType mapping are all extended accordingly. Logical type detection is wired through both logicalType and the arrowTimeUnit="nanosecond" attribute.
  • UUID logical type round‑trip fix: When reading Avro logicalType="uuid" fields, preserve that logical type in Arrow field metadata so writers can round‑trip it back to Avro.
  • Avro writer encoders: Add the missing array encoders and coverage for Arrow’s ListView, LargeListView, and FixedSizeList, and extend array encoder support to BinaryView and Utf8View. (See large additions in writer/encoder.rs.)
  • Safer time/timestamp scaling: Guard second to millisecond conversions in Time32/Timestamp encoders to prevent overflow; encoding now returns a clear InvalidArgument error in those cases.
  • Schema utilities: Add AvroSchemaOptions with null_order and strip_metadata flags so Avro JSON can be built while optionally omitting internal Arrow keys during round‑trip schema generation.
  • Tests & round‑trip coverage: Add unit tests for nanosecond timestamp decoding (UTC, local, and with nulls) and additional end‑to‑end/round‑trip tests for the updated writer paths.

Are these changes tested?

Yes.

  • New decoder tests validate Timestamp(Nanosecond, tz) behavior for UTC and local timestamps and for nullable unions.
  • Writer tests validate the nanosecond encoder and exercise an overflow path for second→millisecond conversion that now returns an error.
  • Additional round‑trip tests were added alongside the new encoders.

Are there any user-facing changes?

N/A since arrow-avro is not public yet.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Oct 13, 2025
…nosecond precision timestamps, and enhance Avro writer array type encoders

- Introduced `TimestampNanos` codec for nanosecond precision timestamps in `arrow-avro`.
- Extended writer support for new data types, including `ListView`, `LargeListView`, and `FixedSizeList`.
- Implemented safe conversions for second-to-millisecond scaling in `Time32` and `Timestamp` encoders.
- Improved extensibility for array encoders to include `BinaryView` and `Utf8View`.
- Added corresponding unit tests for each enhancement.
@jecsand838 jecsand838 force-pushed the avro-remaining-types-roundtrip-tests branch from c00fe62 to 3a9ec80 Compare October 13, 2025 08:15
@jecsand838
Copy link
Contributor Author

@mbrobbel @alamb @nathaniel-d-ef

I added the remaining round trip tests along with the remaining encoder and decoder types. For the types Avro doesn't natively support I left detailed user friendly errors. I plan to come back and add Sparse Union support + improve the custom typing for 58.0.0 release.

This should be the last PR before arrow-avro is ready to go and I think we can close #4886 once this gets merged in (unless I'm forgetting something of course).

Also would we be able to get this one into 57.0.0?

@mbrobbel mbrobbel added this to the 57.0.0 milestone Oct 13, 2025
@jecsand838 jecsand838 force-pushed the avro-remaining-types-roundtrip-tests branch from d79b989 to 698c614 Compare October 13, 2025 10:58
Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple notes.

}

/// Time32(Second) to Avro time-millis (int), via safe scaling by 1000
struct Time32SecondsToMillisEncoder<'a>(&'a PrimitiveArray<Time32SecondType>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit DRYer to have these implement a SecondsToMillis trait or something. I'm okay with this for our needs now though, unless you feel like making that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great callout and it crossed my mind as well. Initially I tried to abstract TimestampSecondsToMillisEncoder and Time32SecondsToMillisEncoder several different ways, however each attempt resulted in more complexity and more code. I tried approaches with involving traits, generics, and macros.

So I went with the rule of three on this one. My thinking is if the need for a third scaling encoder comes up, I could tackle it then.

}

/// Build Avro JSON from an Arrow [`ArrowSchema`], applying the given null‑union order.
/// Build Avro JSON from an Arrow [`ArrowSchema`], applying the given null‑union order and optionally stripping internal Arrow metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance we run into issues here with an all-or-nothing approach to removing? Isn't the metadata simply ignored downstream unless it's specifically being referenced?

Copy link
Contributor Author

@jecsand838 jecsand838 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't. By adding metadata that is irrelevant to an Avro Schema, we are just polluting the schema and complicating round tripping imo.

I think there's actually a risk of us running into issues by keeping the metadata present. I can foresee future contributors adding Reader behavior based on internal metadata that unnecessarily complicate decoder behavior.

Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good; I had to probe because I've definitely debugged one or two things in the past where a valid property vanished because of a setting 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah completely understand. In fact that was a motivation for this change actually. This setting to strip out the metadata is only used prior to writing the Avro schema to an OCF file and it's not publicly accessible outside of the crate.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate) fn from_arrow_with_options(
schema: &ArrowSchema,
null_order: Option<Nullability>,
options: Option<AvroSchemaOptions>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

/// Error during JSON-related operations.
JsonError(String),
/// Error during Avro-related operations.
AvroError(String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API change so we should get it in before arrow-57 is released

@alamb alamb added the api-change Changes to the arrow API label Oct 14, 2025
@alamb alamb changed the title Add remaining types and roundtrip tests to arrow-avro Add ArrowError::AvroError, remaining types and roundtrip tests to arrow-avro, Oct 14, 2025
@mbrobbel mbrobbel merged commit 973e6fc into apache:main Oct 15, 2025
26 checks passed
@jecsand838 jecsand838 deleted the avro-remaining-types-roundtrip-tests branch October 24, 2025 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Avro Support

4 participants