-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Upgrade DataFusion to arrow/parquet 57.0.0 #17888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Many of the current failures are due because this used to work: select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'or SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM tAfter the arrow 57 upgrade it fails with errors like # arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)I believe the problem is that the format of the timezone has changed into I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue |
ed43cc0 to
ee2de0c
Compare
|
|
||
| // Create Flight client | ||
| let mut client = FlightServiceClient::connect("http://localhost:50051").await?; | ||
| let endpoint = Endpoint::new("http://localhost:50051")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to new version of tonic
|
|
||
| // add an initial FlightData message that sends schema | ||
| let options = arrow::ipc::writer::IpcWriteOptions::default(); | ||
| let mut compression_context = CompressionContext::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| let validate = | ||
| T::validate_decimal_precision(new_value, self.target_precision); | ||
| let validate = T::validate_decimal_precision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this (get better messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add "Closes #3666" to the PR body 👍
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
| List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket
I will try and call out individual changes when I see them. Lists are way nicer now:
| 05)--------ProjectionExec: expr=[] | ||
| 06)----------CoalesceBatchesExec: target_batch_size=8192 | ||
| 07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) | ||
| 07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SELECT arrow_typeof(now()::timestamp) | ||
| ---- | ||
| Timestamp(Nanosecond, None) | ||
| Timestamp(ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll need to update https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/data_types.md and https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md (the latter via updating the source docs in code) too, but can do in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| ## Timestamps: Create a table | ||
|
|
||
| statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp format has changed (improved!) so let's also add tests for the new format
| pbjson-types = { workspace = true } | ||
| prost = { workspace = true } | ||
| substrait = { version = "0.58", features = ["serde"] } | ||
| substrait = { version = "0.59", features = ["serde"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since prost is updated, we also must update substrait
9d06200 to
1b7b559
Compare
8ecbbed to
d3b328b
Compare
f61623e to
9f6a390
Compare
d5bd26e to
7709acc
Compare
| let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \ | ||
| Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }"; | ||
| assert_eq!(expected, arrow_schema.to_string()); | ||
| insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many many diffs are due to the changes in formatting of Fields and DataTypes (see below)
| +----------------------+ | ||
| | arrow_typeof(test.l) | | ||
| +----------------------+ | ||
| | List(nullable Int32) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new display is much easier to read in my opinion
|
Ok, the tests are now looking good enough to test with the new thrift decoder |
7709acc to
5e1ea80
Compare
|
🤖 |
|
This PR is now ready for review |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
| // Can disable the cache even with filter pushdown by setting the size to 0. In this case we | ||
| // expect the inner records are reported but no records are read from the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wording a bit off here, since it reads as
Can disable the cache even with filter pushdown by setting the size to 0. In this case we no records are read from the cache and no metrics are reported
Should be this maybe?
Can disable the cache even with filter pushdown by setting the size to 0. This results in no records being read from the cache and no metrics being reported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call -- done in 3a90c6e
|
|
||
| let validate = | ||
| T::validate_decimal_precision(new_value, self.target_precision); | ||
| let validate = T::validate_decimal_precision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add "Closes #3666" to the PR body 👍
| SELECT arrow_typeof(now()::timestamp) | ||
| ---- | ||
| Timestamp(Nanosecond, None) | ||
| Timestamp(ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll need to update https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/data_types.md and https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md (the latter via updating the source docs in code) too, but can do in a follow up
| statement error DataFusion error: type_coercion\ncaused by\nError during planning: Cannot coerce arithmetic expression Timestamp\(Nanosecond, Some\("\+00:00"\)\) \+ Utf8 to valid types | ||
| statement error | ||
| select i_item_desc from test | ||
| where d3_date > now() + '5 days'; | ||
| ---- | ||
| DataFusion error: type_coercion | ||
| caused by | ||
| Error during planning: Cannot coerce arithmetic expression Timestamp(ns, "+00:00") + Utf8 to valid types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the expected error comes before the query not after, for SLTs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be either (this change was created by running with --complete)
I believe when the message comes after the query it can contain multiple lines (aka if the error itself actually contains `\n')
|
cargo audit failure seems unrelated |
|
Merging up to get the fix for |
|
Alright, here we go! |
Followup some doc fixes missed in #17888
….0 (#17866) ## Which issue does this PR close? - Closes #17865. ## What changes are included in this PR? Bump the `substrait` version to `v0.75.0` by bumping `substrait-rs` to `v0.60.0`. This PR was originally dependent on [this PR](#17888) to update the versions of some common dependencies, but that PR is now merged in. ## Are these changes tested? There are no tests here, but there is no change to any logic within datafusion. It is simply a bump in a dependency. Technically the public API does change, but as noted in the issue description, there is no change to internal logic because uri / urn from substrait plans are not used. ## Are there any user-facing changes? Yes. Previously substrait plans of spec version `v0.74.0` were accepted, and now `v0.75.0` is accepted. However, this is a backwards compatible change. The only difference is the inclusion of additional urn-based fields in substrait plans. In a later PR, the old uri-based fields will be dropped, which *will* be a breaking change. --------- Co-authored-by: Andrew Lamb <[email protected]>
## Which issue does this PR close? - Related to apache/arrow-rs#7835 - Closes apache#3666 Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes ## Rationale for this change Upgrade to the latest arrow Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader. ## What changes are included in this PR? 1. Update arrow/parquet 2. Update prost 3. Update substrait 4. Update pbjson 5. Make API changes to avoid deprecated APIs ## Are these changes tested? By CI ## Are there any user-facing changes? New arrow
Followup some doc fixes missed in apache#17888
….0 (apache#17866) ## Which issue does this PR close? - Closes apache#17865. ## What changes are included in this PR? Bump the `substrait` version to `v0.75.0` by bumping `substrait-rs` to `v0.60.0`. This PR was originally dependent on [this PR](apache#17888) to update the versions of some common dependencies, but that PR is now merged in. ## Are these changes tested? There are no tests here, but there is no change to any logic within datafusion. It is simply a bump in a dependency. Technically the public API does change, but as noted in the issue description, there is no change to internal logic because uri / urn from substrait plans are not used. ## Are there any user-facing changes? Yes. Previously substrait plans of spec version `v0.74.0` were accepted, and now `v0.75.0` is accepted. However, this is a backwards compatible change. The only difference is the inclusion of additional urn-based fields in substrait plans. In a later PR, the old uri-based fields will be dropped, which *will* be a breaking change. --------- Co-authored-by: Andrew Lamb <[email protected]>
## Which issue does this PR close? - Related to apache/arrow-rs#7835 - Closes apache#3666 Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes ## Rationale for this change Upgrade to the latest arrow Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader. ## What changes are included in this PR? 1. Update arrow/parquet 2. Update prost 3. Update substrait 4. Update pbjson 5. Make API changes to avoid deprecated APIs ## Are these changes tested? By CI ## Are there any user-facing changes? New arrow
Followup some doc fixes missed in apache#17888
….0 (apache#17866) ## Which issue does this PR close? - Closes apache#17865. ## What changes are included in this PR? Bump the `substrait` version to `v0.75.0` by bumping `substrait-rs` to `v0.60.0`. This PR was originally dependent on [this PR](apache#17888) to update the versions of some common dependencies, but that PR is now merged in. ## Are these changes tested? There are no tests here, but there is no change to any logic within datafusion. It is simply a bump in a dependency. Technically the public API does change, but as noted in the issue description, there is no change to internal logic because uri / urn from substrait plans are not used. ## Are there any user-facing changes? Yes. Previously substrait plans of spec version `v0.74.0` were accepted, and now `v0.75.0` is accepted. However, this is a backwards compatible change. The only difference is the inclusion of additional urn-based fields in substrait plans. In a later PR, the old uri-based fields will be dropped, which *will* be a breaking change. --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
57.0.0(October 2025) arrow-rs#7835Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes
Rationale for this change
Upgrade to the latest arrow
Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader.
What changes are included in this PR?
Are these changes tested?
By CI
Are there any user-facing changes?
New arrow