Skip to content

Conversation

@Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Remove overriding the description method in the V2 file sources. FileScan already uses all the metadata to create the description, so adding the same fields to the overridden description creates duplicates.

Why are the changes needed?

Example parquet scan from the agg pushdown suite:

Before:

+- BatchScan parquet file:/...[min(_3)#814, max(_3)#815, min(_1)#816, max(_1)#817, count(*)#818L, count(_1)#819L, count(_2)#820L, count(_3)#821L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/..., PartitionFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<min(_3):int,max(_3):int,min(_1):int,max(_1):int,count(*):bigint,count(_1):bigint,count(_2)..., PushedFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedGroupBy: [] RuntimeFilters: []

After:

 +- BatchScan parquet file:/...[min(_3)#814, max(_3)#815, min(_1)#816, max(_1)#817, count(*)#818L, count(_1)#819L, count(_2)#820L, count(_3)#821L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/..., PartitionFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<min(_3):int,max(_3):int,min(_1):int,max(_1):int,count(*):bigint,count(_1):bigint,count(_2)... RuntimeFilters: []

Does this PR introduce any user-facing change?

Just description change in explain output.

How was this patch tested?

Updated a few UTs to accommodate checking explain string.

@github-actions github-actions bot added the SQL label Oct 13, 2022
Comment on lines -94 to +95
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why JSON was different than the others but made it the same by just providing metadata. Updated related UT below to accommodate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overridden description part wasn't limited in chars, but the metadata version is, so just bump up the max size to make sure the full string can be found

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Kimahriman
Copy link
Contributor Author

cc @dongjoon-hyun @cloud-fan @maropu, you guys were on the original PR review for the metadata addition

@cloud-fan
Copy link
Contributor

can we also update AvroScan ?

@Kimahriman
Copy link
Contributor Author

can we also update AvroScan ?

Ah good call, updated that as well.

@github-actions github-actions bot added the AVRO label Oct 31, 2022
@dongjoon-hyun
Copy link
Member

Could you rebase this PR to the master in order to make it up-to-date and pass CI, @Kimahriman ?

@Kimahriman Kimahriman force-pushed the remove-file-source-description branch from accbbbe to ae9b783 Compare December 11, 2022 00:37
@Kimahriman
Copy link
Contributor Author

Done (pending CI)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1f4c8e4 Dec 12, 2022
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

Remove overriding the description method in the V2 file sources. `FileScan` already uses all the metadata to create the description, so adding the same fields to the overridden description creates duplicates.

### Why are the changes needed?

Example parquet scan from the agg pushdown suite:

Before:
```
+- BatchScan parquet file:/...[min(_3)apache#814, max(_3)apache#815, min(_1)apache#816, max(_1)apache#817, count(*)#818L, count(_1)#819L, count(_2)#820L, count(_3)#821L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/..., PartitionFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<min(_3):int,max(_3):int,min(_1):int,max(_1):int,count(*):bigint,count(_1):bigint,count(_2)..., PushedFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedGroupBy: [] RuntimeFilters: []
```

After:
```
 +- BatchScan parquet file:/...[min(_3)apache#814, max(_3)apache#815, min(_1)apache#816, max(_1)apache#817, count(*)#818L, count(_1)#819L, count(_2)#820L, count(_3)#821L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/..., PartitionFilters: [], PushedAggregation: [MIN(_3), MAX(_3), MIN(_1), MAX(_1), COUNT(*), COUNT(_1), COUNT(_2), COUNT(_3)], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<min(_3):int,max(_3):int,min(_1):int,max(_1):int,count(*):bigint,count(_1):bigint,count(_2)... RuntimeFilters: []
```

### Does this PR introduce _any_ user-facing change?

Just description change in explain output.

### How was this patch tested?

Updated a few UTs to accommodate checking explain string.

Closes apache#38229 from Kimahriman/remove-file-source-description.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants