-
Notifications
You must be signed in to change notification settings - Fork 1k
[Avro] Support map with non-nullable value type #8254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Avro] Support map with non-nullable value type #8254
Conversation
|
@yongkyunlee Thank you so much for getting this up I did add support for this in #8220 and was just able to locally run the tests you included successfully. Would you be open to getting those changes in first, then getting this PR in for the added test coverage? |
| mod tests { | ||
| use super::*; | ||
| use crate::codec::AvroField; | ||
| use crate::schema::Schema as AvroSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this import due to the use crate::schema::*; on line 22.
| use crate::schema::Schema as AvroSchema; |
| ] | ||
| }"#; | ||
|
|
||
| let schema: AvroSchema = serde_json::from_str(schema_json).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let schema: AvroSchema = serde_json::from_str(schema_json).unwrap(); | |
| let schema: Schema = serde_json::from_str(schema_json).unwrap(); |
| ] | ||
| }"#; | ||
|
|
||
| let schema: AvroSchema = serde_json::from_str(schema_json).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not trying to be too nitty with this, there's just a separate struct named AvroSchema defined in arrow-avro/src/schema.rs
| let schema: AvroSchema = serde_json::from_str(schema_json).unwrap(); | |
| let schema: Schema = serde_json::from_str(schema_json).unwrap(); |
| // Extract the value field nullability from the schema | ||
| let is_value_nullable = match map_field.data_type() { | ||
| DataType::Struct(fields) => fields | ||
| .iter() | ||
| .find(|f| f.name() == "value") | ||
| .map(|f| f.is_nullable()) | ||
| .unwrap_or(false), | ||
| _ => true, // default to nullable | ||
| }; | ||
| let entries_struct = StructArray::new( | ||
| Fields::from(vec![ | ||
| Arc::new(ArrowField::new("key", DataType::Utf8, false)), | ||
| Arc::new(ArrowField::new("value", val_arr.data_type().clone(), true)), | ||
| Arc::new(ArrowField::new( | ||
| "value", | ||
| val_arr.data_type().clone(), | ||
| is_value_nullable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a slightly different approach in #8220 that avoids the field scan and new Field allocation while preserving the existing schema metadata.
It's slightly more performant, especially on smaller batches:
Map/100 time: [6.2237 µs 6.2594 µs 6.2899 µs]
thrpt: [507.32 MiB/s 509.79 MiB/s 512.71 MiB/s]
change:
time: [−0.3038% +0.5799% +1.4930%] (p = 0.19 > 0.05)
thrpt: [−1.4710% −0.5766% +0.3047%]
No change in performance detected.
Map/10000 time: [250.40 µs 253.75 µs 258.62 µs]
thrpt: [1.2573 GiB/s 1.2814 GiB/s 1.2986 GiB/s]
change:
time: [−2.5344% −1.1670% +0.1631%] (p = 0.08 > 0.05)
thrpt: [−0.1628% +1.1808% +2.6003%]
No change in performance detected.
Found 6 outliers among 25 measurements (24.00%)
6 (24.00%) low mild
Map/1000000 time: [252.99 µs 255.93 µs 260.24 µs]
thrpt: [130.21 GiB/s 132.40 GiB/s 133.94 GiB/s]
change:
time: [−1.9418% −0.4373% +1.0680%] (p = 0.60 > 0.05)
thrpt: [−1.0568% +0.4393% +1.9803%]
No change in performance detected.
vs
Map/100 time: [6.4487 µs 6.4584 µs 6.4687 µs]
thrpt: [493.30 MiB/s 494.09 MiB/s 494.83 MiB/s]
change:
time: [+4.2911% +5.0113% +5.7191%] (p = 0.00 < 0.05)
thrpt: [−5.4097% −4.7721% −4.1145%]
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
Map/10000 time: [258.90 µs 260.78 µs 263.17 µs]
thrpt: [1.2356 GiB/s 1.2469 GiB/s 1.2559 GiB/s]
change:
time: [−0.6514% +0.8318% +2.5189%] (p = 0.34 > 0.05)
thrpt: [−2.4570% −0.8249% +0.6557%]
No change in performance detected.
Found 1 outliers among 25 measurements (4.00%)
1 (4.00%) high severe
Map/1000000 time: [265.48 µs 268.25 µs 270.56 µs]
thrpt: [125.24 GiB/s 126.33 GiB/s 127.64 GiB/s]
change:
time: [+3.0081% +4.1202% +5.2124%] (p = 0.00 < 0.05)
thrpt: [−4.9542% −3.9572% −2.9203%]
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
If you wanted to check it out:
arrow-rs/arrow-avro/src/reader/record.rs
Line 685 in ebf4029
| let entries_fields = match map_field.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jecsand838 Thanks for the context. Should I close this PR if #8220 covers this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely think there's value in the tests you wrote if you wanted to merge those in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I'll close this PR and raise a new PR later with only tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
Which issue does this PR close?
Rationale for this change
This PR allows arrow-avro to decode field of type map whose value is non-nullable.
The stack trace and details are mentioned in the Issues ticket.
What changes are included in this PR?
Fix how map type is handled.
Are these changes tested?
Are there any user-facing changes?
N/A