-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
When a RecordBatch is stored in a parquet file and then retrieved the time portion of Datatype::Date64 values is changed to 0.
To Reproduce
with this schema:
Field::new(“item”, DataType::Utf8, false),
Field::new(“timestamp”, DataType::Date64, false)
- Read the csv1 data below into batch1
- Write batch1 to csv1a
- Compare csv1 to csv1a — they match
- Write batch1 to a parquet file
- Read batch2 from the same parquet file
- Write batch2 to csv2
- Compare csv1 to csv2 — they don’t match because in csv2 the times are all 00:00:00.000000000
csv1:
item,timestamp
1,1998-10-28T19:10:30.056000000
2,1998-10-30T11:10:10.623000000
3,1999-01-23T17:10:31.006000000
csv2:
item,timestamp
1,1998-10-28T00:00:00.000000000
2,1998-10-30T00:00:00.000000000
3,1999-01-23T00:00:00.000000000
Expected behavior
The time portion of the DataType::Date64 value should be preserved in parquet just as it is in csv.
Additional context
Version 8.0.0
It looks like this unit test needs to include some non-zero times:
#[test]
fn date64_single_column() {
// Date64 must be a multiple of 86400000, see ARROW-10925
required_and_optional::<Date64Array, _>(
(0..(SMALL_SIZE as i64 * 86400000)).step_by(86400000),
);
}
According to ARROW-10925 a valid time is in the range 0..86400000 milliseconds.
Here DataType::Date64 is defined to be in milliseconds: https://arrow.apache.org/docs/cpp/api/datatype.html