-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40775][SQL] Fix duplicate description entries for V2 file scans #38229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40775][SQL] Fix duplicate description entries for V2 file scans #38229
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why JSON was different than the others but made it the same by just providing metadata. Updated related UT below to accommodate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overridden description part wasn't limited in chars, but the metadata version is, so just bump up the max size to make sure the full string can be found
|
Can one of the admins verify this patch? |
|
cc @dongjoon-hyun @cloud-fan @maropu, you guys were on the original PR review for the metadata addition |
|
can we also update |
Ah good call, updated that as well. |
|
Could you rebase this PR to the master in order to make it up-to-date and pass CI, @Kimahriman ? |
accbbbe to
ae9b783
Compare
|
Done (pending CI) |
|
thanks, merging to master! |
### What changes were proposed in this pull request? Remove overriding the description method in the V2 file sources. `FileScan` already uses all the metadata to create the description, so adding the same fields to the overridden description creates duplicates. ### Why are the changes needed? Example parquet scan from the agg pushdown suite: Before: ``` +- BatchScan parquet file:/...[min(_3)apache#814, max(_3)apache#815, min(_1)apache#816, max(_1)apache#817, count(*)#818L, count(_1)#819L, count(_2)#820L, count(_3)#821L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/..., PartitionFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<min(_3):int,max(_3):int,min(_1):int,max(_1):int,count(*):bigint,count(_1):bigint,count(_2)..., PushedFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedGroupBy: [] RuntimeFilters: [] ``` After: ``` +- BatchScan parquet file:/...[min(_3)apache#814, max(_3)apache#815, min(_1)apache#816, max(_1)apache#817, count(*)#818L, count(_1)#819L, count(_2)#820L, count(_3)#821L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/..., PartitionFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<min(_3):int,max(_3):int,min(_1):int,max(_1):int,count(*):bigint,count(_1):bigint,count(_2)... RuntimeFilters: [] ``` ### Does this PR introduce _any_ user-facing change? Just description change in explain output. ### How was this patch tested? Updated a few UTs to accommodate checking explain string. Closes apache#38229 from Kimahriman/remove-file-source-description. Authored-by: Adam Binford <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Remove overriding the description method in the V2 file sources.
FileScanalready uses all the metadata to create the description, so adding the same fields to the overridden description creates duplicates.Why are the changes needed?
Example parquet scan from the agg pushdown suite:
Before:
After:
Does this PR introduce any user-facing change?
Just description change in explain output.
How was this patch tested?
Updated a few UTs to accommodate checking explain string.