-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support round trip reading / writing Arrow Duration type to parquet
#7482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CI Failure should be fixed by @crepererum 's PR: |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Liyixin95 -- this looks really nice. I poked around and actually found that there were some pre-existing tests ❤️
I also tested this PR manually using the duration.parquet file from apache/parquet-testing#80 and the following code:
Details
fn main() {
let file = File::open("/Users/andrewlamb/Downloads/duration.parquet").unwrap();
let reader = ParquetRecordBatchReaderBuilder::try_new(file)
.unwrap()
.build()
.unwrap();
for batch in reader {
let batch = batch.unwrap();
println!("Read batch schema: {:?}", batch.schema());
println!("Read batch num rows: {:?}", batch.num_rows());
println!("Read batch pretty:\n{}", pretty_format_batches(&[batch]).unwrap());
}
}Before this PR:
Field { name: "a", data_type: Int64:
warning: `parquet_test` (bin "parquet_test") generated 1 warning
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
Running `target/debug/parquet_test`
Read batch schema: Schema { fields: [Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
Read batch num rows: 100
Read batch pretty:
+-------------+
| a |
+-------------+
| 86400000000 |
...
After this PR (the schema shows duration):
Field { name: "a", data_type: Duration(Microsecond),
Read batch schema: Schema { fields: [Field { name: "a", data_type: Duration(Microsecond), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
Read batch num rows: 100
Read batch pretty:
+----------+
| a |
+----------+
| PT86400S |
|
|
||
| #[test] | ||
| #[should_panic(expected = "Converting Duration to parquet not supported")] | ||
| fn duration_second_single_column() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there were already existing tests for the feature ❤️ (thanks @carols10cents )
Duration type to parquet
Duration type to parquetDuration type to parquet
Oh, Thanks. I didn't notice there were some pre-existing tests. I just create the pr and go to sleep. |
I didn't realize there were any tests either until I checked out the branch locally and they failed for me 😆 |
continue from previous pr. @alamb