-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy #47483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…columns in FileSourceStrategy Compute the schema of the data without partition columns only once in FileSourceStrategy. In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (`partitionSet`) and using the attributes directly (`partitionColumns`) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here. No Existing tests No Closes apache#46619 from johanl-db/reuse-schema-without-partition-columns. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
This fixes a regression caused by https://github.com/apache/spark/pull/46565/files#diff-fbc6da30b8372e4f9aeb35ccf0d39eb796715d192c7eaeab109376584de0790eR121 , and make Delta Lake pass all tests. I think we should include it in 3.5.2. cc @yaooqinn |
|
Do we need this for branch 3.4? |
…columns in FileSourceStrategy ### What changes were proposed in this pull request? (Cherry-pick of 57948c8 to branch-3.5) Compute the schema of the data without partition columns only once in FileSourceStrategy. ### Why are the changes needed? In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (`partitionSet`) and using the attributes directly (`partitionColumns`) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Authored-by: Johan Lasperas <johan.lasperasdatabricks.com> Closes #47483 from vkorukanti/partitionCols. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Kent Yao <[email protected]>
|
Merged to branch-3.5, thank you all |
|
It's fine to skip 3.4 as #46565 was not merged to 3.4 either. |
|
Thank you @cloud-fan |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to find and fix this during RC2 vote period. Thank you, @vkorukanti , @cloud-fan , @yaooqinn .
…columns in FileSourceStrategy (apache#538) ### What changes were proposed in this pull request? (Cherry-pick of 57948c8 to branch-3.5) Compute the schema of the data without partition columns only once in FileSourceStrategy. ### Why are the changes needed? In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (`partitionSet`) and using the attributes directly (`partitionColumns`) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Authored-by: Johan Lasperas <johan.lasperasdatabricks.com> Closes apache#47483 from vkorukanti/partitionCols. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Kent Yao <[email protected]> Co-authored-by: Johan Lasperas <[email protected]>
What changes were proposed in this pull request?
(Cherry-pick of 57948c8 to branch-3.5)
Compute the schema of the data without partition columns only once in FileSourceStrategy.
Why are the changes needed?
In FileSourceStrategy, the schema of the data excluding partition columns is computed 2 times in a slightly different way, using an AttributeSet (
partitionSet) and using the attributes directly (partitionColumns) These don't have the exact same semantics, AttributeSet will only use expression ids for comparison while comparing with the actual attributes will use the name, type, nullability and metadata. We want to use the former here.Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests
Was this patch authored or co-authored using generative AI tooling?
No
Authored-by: Johan Lasperas [email protected]