Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2025

Which issue does this PR close?

Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes

Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader.

What changes are included in this PR?

  1. Update arrow/parquet
  2. Update prost
  3. Update substrait
  4. Update pbjson
  5. Make API changes to avoid deprecated APIs

Are these changes tested?

By CI

Are there any user-facing changes?

New arrow

@github-actions github-actions bot added the common Related to common crate label Oct 2, 2025
@github-actions github-actions bot added substrait Changes to the substrait crate proto Related to proto crate labels Oct 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2025

Many of the current failures are due because this used to work:

select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'

or

SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM t

After the arrow 57 upgrade it fails with errors like

statement error DataFusion error: Execution error: Unsupported type 'Timestamp\(Nanosecond, None\)'\. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'\. Error expected double quoted string for Timezone, got 'None'
# arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)

I believe the problem is that the format of the timezone has changed into Timestamp(ns) and then the FromStr method doesn't handle that. I will work on filing an update

I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue


// Create Flight client
let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
let endpoint = Endpoint::new("http://localhost:50051")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to new version of tonic


// add an initial FlightData message that sends schema
let options = arrow::ipc::writer::IpcWriteOptions::default();
let mut compression_context = CompressionContext::default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


let validate =
T::validate_decimal_precision(new_value, self.target_precision);
let validate = T::validate_decimal_precision(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add "Closes #3666" to the PR body 👍

List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket

I will try and call out individual changes when I see them. Lists are way nicer now:

05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT arrow_typeof(now()::timestamp)
----
Timestamp(Nanosecond, None)
Timestamp(ns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call. Done in 1e8ddfa and f9606d8


## Timestamps: Create a table

statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp format has changed (improved!) so let's also add tests for the new format

pbjson-types = { workspace = true }
prost = { workspace = true }
substrait = { version = "0.58", features = ["serde"] }
substrait = { version = "0.59", features = ["serde"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since prost is updated, we also must update substrait

@github-actions github-actions bot added the core Core DataFusion crate label Oct 2, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 9d06200 to 1b7b559 Compare October 2, 2025 18:56
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Oct 2, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 8ecbbed to d3b328b Compare October 3, 2025 15:48
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from f61623e to 9f6a390 Compare October 3, 2025 16:04
@github-actions github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Oct 3, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from d5bd26e to 7709acc Compare October 3, 2025 20:26
let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \
Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }";
assert_eq!(expected, arrow_schema.to_string());
insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many many diffs are due to the changes in formatting of Fields and DataTypes (see below)

+----------------------+
| arrow_typeof(test.l) |
+----------------------+
| List(nullable Int32) |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new display is much easier to read in my opinion

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2025

Ok, the tests are now looking good enough to test with the new thrift decoder

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: tpch_mem
Results will be posted here when complete

@alamb alamb changed the title [WIP] Upgrade to arrow/parquet 57.0.0 Upgrade to arrow/parquet 57.0.0 Oct 23, 2025
@alamb alamb marked this pull request as ready for review October 23, 2025 16:39
@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2025

This PR is now ready for review

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines 634 to 635
// Can disable the cache even with filter pushdown by setting the size to 0. In this case we
// expect the inner records are reported but no records are read from the cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wording a bit off here, since it reads as

Can disable the cache even with filter pushdown by setting the size to 0. In this case we no records are read from the cache and no metrics are reported

Should be this maybe?

Can disable the cache even with filter pushdown by setting the size to 0. This results in no records being read from the cache and no metrics being reported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call -- done in 3a90c6e


let validate =
T::validate_decimal_precision(new_value, self.target_precision);
let validate = T::validate_decimal_precision(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add "Closes #3666" to the PR body 👍

SELECT arrow_typeof(now()::timestamp)
----
Timestamp(Nanosecond, None)
Timestamp(ns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -88 to +94
statement error DataFusion error: type_coercion\ncaused by\nError during planning: Cannot coerce arithmetic expression Timestamp\(Nanosecond, Some\("\+00:00"\)\) \+ Utf8 to valid types
statement error
select i_item_desc from test
where d3_date > now() + '5 days';
----
DataFusion error: type_coercion
caused by
Error during planning: Cannot coerce arithmetic expression Timestamp(ns, "+00:00") + Utf8 to valid types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the expected error comes before the query not after, for SLTs 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be either (this change was created by running with --complete)

I believe when the message comes after the query it can contain multiple lines (aka if the error itself actually contains `\n')

@alamb alamb changed the title Upgrade to arrow/parquet 57.0.0 Upgrade DataFusion to arrow/parquet 57.0.0 Oct 24, 2025
@alamb alamb self-assigned this Oct 24, 2025
@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2025

@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2025

@alamb alamb added this pull request to the merge queue Oct 27, 2025
@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2025

Alright, here we go!

Merged via the queue into apache:main with commit c6ad17c Oct 27, 2025
36 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Oct 28, 2025
github-merge-queue bot pushed a commit that referenced this pull request Oct 31, 2025
….0 (#17866)

## Which issue does this PR close?

- Closes #17865.

## What changes are included in this PR?

Bump the `substrait` version to `v0.75.0` by bumping `substrait-rs` to
`v0.60.0`.

This PR was originally dependent on [this
PR](#17888) to update the
versions of some common dependencies, but that PR is now merged in.

## Are these changes tested?

There are no tests here, but there is no change to any logic within
datafusion. It is simply a bump in a dependency. Technically the public
API does change, but as noted in the issue description, there is no
change to internal logic because uri / urn from substrait plans are not
used.

## Are there any user-facing changes?

Yes. Previously substrait plans of spec version `v0.74.0` were accepted,
and now `v0.75.0` is accepted. However, this is a backwards compatible
change. The only difference is the inclusion of additional urn-based
fields in substrait plans. In a later PR, the old uri-based fields will
be dropped, which *will* be a breaking change.

---------

Co-authored-by: Andrew Lamb <[email protected]>
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
## Which issue does this PR close?

- Related to apache/arrow-rs#7835
- Closes apache#3666

Note while this PR looks massive, a large portion is display updates due
to better display of Fields and DataTypes

## Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able
to test including Variant, arrow-avro, and a new parquet metadata
reader.

## What changes are included in this PR?

1. Update arrow/parquet
2. Update prost
3. Update substrait
4. Update pbjson
5. Make API changes to avoid deprecated APIs

## Are these changes tested?

By CI

## Are there any user-facing changes?
New arrow
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
….0 (apache#17866)

## Which issue does this PR close?

- Closes apache#17865.

## What changes are included in this PR?

Bump the `substrait` version to `v0.75.0` by bumping `substrait-rs` to
`v0.60.0`.

This PR was originally dependent on [this
PR](apache#17888) to update the
versions of some common dependencies, but that PR is now merged in.

## Are these changes tested?

There are no tests here, but there is no change to any logic within
datafusion. It is simply a bump in a dependency. Technically the public
API does change, but as noted in the issue description, there is no
change to internal logic because uri / urn from substrait plans are not
used.

## Are there any user-facing changes?

Yes. Previously substrait plans of spec version `v0.74.0` were accepted,
and now `v0.75.0` is accepted. However, this is a backwards compatible
change. The only difference is the inclusion of additional urn-based
fields in substrait plans. In a later PR, the old uri-based fields will
be dropped, which *will* be a breaking change.

---------

Co-authored-by: Andrew Lamb <[email protected]>
codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
## Which issue does this PR close?

- Related to apache/arrow-rs#7835
- Closes apache#3666

Note while this PR looks massive, a large portion is display updates due
to better display of Fields and DataTypes

## Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able
to test including Variant, arrow-avro, and a new parquet metadata
reader.

## What changes are included in this PR?

1. Update arrow/parquet
2. Update prost
3. Update substrait
4. Update pbjson
5. Make API changes to avoid deprecated APIs

## Are these changes tested?

By CI

## Are there any user-facing changes?
New arrow
codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
….0 (apache#17866)

## Which issue does this PR close?

- Closes apache#17865.

## What changes are included in this PR?

Bump the `substrait` version to `v0.75.0` by bumping `substrait-rs` to
`v0.60.0`.

This PR was originally dependent on [this
PR](apache#17888) to update the
versions of some common dependencies, but that PR is now merged in.

## Are these changes tested?

There are no tests here, but there is no change to any logic within
datafusion. It is simply a bump in a dependency. Technically the public
API does change, but as noted in the issue description, there is no
change to internal logic because uri / urn from substrait plans are not
used.

## Are there any user-facing changes?

Yes. Previously substrait plans of spec version `v0.74.0` were accepted,
and now `v0.75.0` is accepted. However, this is a backwards compatible
change. The only difference is the inclusion of additional urn-based
fields in substrait plans. In a later PR, the old uri-based fields will
be dropped, which *will* be a breaking change.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate functions Changes to functions implementation optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect error message for decimal with scale while input value is out of bound

5 participants