Skip to content

Conversation

@vkorukanti
Copy link
Member

@vkorukanti vkorukanti commented Jul 25, 2024

What changes were proposed in this pull request?

(Cherry-pick of 57948c8 to branch-3.5)

Compute the schema of the data without partition columns only once in FileSourceStrategy.

Why are the changes needed?

In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (partitionSet) and using the attributes directly (partitionColumns) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

Authored-by: Johan Lasperas [email protected]

…columns in FileSourceStrategy

Compute the schema of the data without partition columns only once in FileSourceStrategy.

In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (`partitionSet`) and using the attributes directly (`partitionColumns`)
These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here.

No

Existing tests

No

Closes apache#46619 from johanl-db/reuse-schema-without-partition-columns.

Authored-by: Johan Lasperas <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@github-actions github-actions bot added the SQL label Jul 25, 2024
@cloud-fan
Copy link
Contributor

This fixes a regression caused by https://github.com/apache/spark/pull/46565/files#diff-fbc6da30b8372e4f9aeb35ccf0d39eb796715d192c7eaeab109376584de0790eR121 , and make Delta Lake pass all tests. I think we should include it in 3.5.2. cc @yaooqinn

@yaooqinn
Copy link
Member

Do we need this for branch 3.4?

yaooqinn pushed a commit that referenced this pull request Jul 25, 2024
…columns in FileSourceStrategy

### What changes were proposed in this pull request?

(Cherry-pick of 57948c8 to branch-3.5)

Compute the schema of the data without partition columns only once in FileSourceStrategy.

### Why are the changes needed?

In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (`partitionSet`) and using the attributes directly (`partitionColumns`) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Authored-by: Johan Lasperas <johan.lasperasdatabricks.com>

Closes #47483 from vkorukanti/partitionCols.

Authored-by: Johan Lasperas <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
@yaooqinn
Copy link
Member

Merged to branch-3.5, thank you all

@yaooqinn yaooqinn closed this Jul 25, 2024
@cloud-fan
Copy link
Contributor

It's fine to skip 3.4 as #46565 was not merged to 3.4 either.

@yaooqinn
Copy link
Member

Thank you @cloud-fan

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to find and fix this during RC2 vote period. Thank you, @vkorukanti , @cloud-fan , @yaooqinn .

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…columns in FileSourceStrategy (apache#538)

### What changes were proposed in this pull request?

(Cherry-pick of 57948c8 to branch-3.5)

Compute the schema of the data without partition columns only once in FileSourceStrategy.

### Why are the changes needed?

In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (`partitionSet`) and using the attributes directly (`partitionColumns`) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Authored-by: Johan Lasperas <johan.lasperasdatabricks.com>

Closes apache#47483 from vkorukanti/partitionCols.

Authored-by: Johan Lasperas <[email protected]>

Signed-off-by: Kent Yao <[email protected]>
Co-authored-by: Johan Lasperas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants