Skip to content

Conversation

@yongkyunlee
Copy link
Contributor

@yongkyunlee yongkyunlee commented Aug 30, 2025

Which issue does this PR close?

Rationale for this change

This PR allows arrow-avro to decode field of type map whose value is non-nullable.
The stack trace and details are mentioned in the Issues ticket.

What changes are included in this PR?

Fix how map type is handled.

Are these changes tested?

  • Unit tests

Are there any user-facing changes?

N/A

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 30, 2025
@jecsand838
Copy link
Contributor

jecsand838 commented Aug 30, 2025

@yongkyunlee Thank you so much for getting this up

I did add support for this in #8220 and was just able to locally run the tests you included successfully. Would you be open to getting those changes in first, then getting this PR in for the added test coverage?

mod tests {
use super::*;
use crate::codec::AvroField;
use crate::schema::Schema as AvroSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this import due to the use crate::schema::*; on line 22.

Suggested change
use crate::schema::Schema as AvroSchema;

]
}"#;

let schema: AvroSchema = serde_json::from_str(schema_json).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let schema: AvroSchema = serde_json::from_str(schema_json).unwrap();
let schema: Schema = serde_json::from_str(schema_json).unwrap();

]
}"#;

let schema: AvroSchema = serde_json::from_str(schema_json).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trying to be too nitty with this, there's just a separate struct named AvroSchema defined in arrow-avro/src/schema.rs

Suggested change
let schema: AvroSchema = serde_json::from_str(schema_json).unwrap();
let schema: Schema = serde_json::from_str(schema_json).unwrap();

Comment on lines +593 to +608
// Extract the value field nullability from the schema
let is_value_nullable = match map_field.data_type() {
DataType::Struct(fields) => fields
.iter()
.find(|f| f.name() == "value")
.map(|f| f.is_nullable())
.unwrap_or(false),
_ => true, // default to nullable
};
let entries_struct = StructArray::new(
Fields::from(vec![
Arc::new(ArrowField::new("key", DataType::Utf8, false)),
Arc::new(ArrowField::new("value", val_arr.data_type().clone(), true)),
Arc::new(ArrowField::new(
"value",
val_arr.data_type().clone(),
is_value_nullable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a slightly different approach in #8220 that avoids the field scan and new Field allocation while preserving the existing schema metadata.

It's slightly more performant, especially on smaller batches:

Map/100                 time:   [6.2237 µs 6.2594 µs 6.2899 µs]
                        thrpt:  [507.32 MiB/s 509.79 MiB/s 512.71 MiB/s]
                 change:
                        time:   [−0.3038% +0.5799% +1.4930%] (p = 0.19 > 0.05)
                        thrpt:  [−1.4710% −0.5766% +0.3047%]
                        No change in performance detected.
Map/10000               time:   [250.40 µs 253.75 µs 258.62 µs]
                        thrpt:  [1.2573 GiB/s 1.2814 GiB/s 1.2986 GiB/s]
                 change:
                        time:   [−2.5344% −1.1670% +0.1631%] (p = 0.08 > 0.05)
                        thrpt:  [−0.1628% +1.1808% +2.6003%]
                        No change in performance detected.
Found 6 outliers among 25 measurements (24.00%)
  6 (24.00%) low mild
Map/1000000             time:   [252.99 µs 255.93 µs 260.24 µs]
                        thrpt:  [130.21 GiB/s 132.40 GiB/s 133.94 GiB/s]
                 change:
                        time:   [−1.9418% −0.4373% +1.0680%] (p = 0.60 > 0.05)
                        thrpt:  [−1.0568% +0.4393% +1.9803%]
                        No change in performance detected.

vs

Map/100                 time:   [6.4487 µs 6.4584 µs 6.4687 µs]
                        thrpt:  [493.30 MiB/s 494.09 MiB/s 494.83 MiB/s]
                 change:
                        time:   [+4.2911% +5.0113% +5.7191%] (p = 0.00 < 0.05)
                        thrpt:  [−5.4097% −4.7721% −4.1145%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
Map/10000               time:   [258.90 µs 260.78 µs 263.17 µs]
                        thrpt:  [1.2356 GiB/s 1.2469 GiB/s 1.2559 GiB/s]
                 change:
                        time:   [−0.6514% +0.8318% +2.5189%] (p = 0.34 > 0.05)
                        thrpt:  [−2.4570% −0.8249% +0.6557%]
                        No change in performance detected.
Found 1 outliers among 25 measurements (4.00%)
  1 (4.00%) high severe
Map/1000000             time:   [265.48 µs 268.25 µs 270.56 µs]
                        thrpt:  [125.24 GiB/s 126.33 GiB/s 127.64 GiB/s]
                 change:
                        time:   [+3.0081% +4.1202% +5.2124%] (p = 0.00 < 0.05)
                        thrpt:  [−4.9542% −3.9572% −2.9203%]
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)

If you wanted to check it out:

let entries_fields = match map_field.data_type() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jecsand838 Thanks for the context. Should I close this PR if #8220 covers this issue?

Copy link
Contributor

@jecsand838 jecsand838 Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think there's value in the tests you wrote if you wanted to merge those in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'll close this PR and raise a new PR later with only tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

@yongkyunlee yongkyunlee closed this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Avro] Decoder panics on flush when schema contains map whose value is non-nullable

2 participants