-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
EPICA larger project, actively underway, with sub tasksA larger project, actively underway, with sub tasksenhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion offers sophisticated "filter pushdown" optimizations into LogicalPlan::TableScan by passing predicates into TableProvider::scan.
This ticket tracks the work to make use of these predicates in the table provider for parquet files, ParquetFileReader. Much of this work has been completed by the writing of this ticket, but I wanted to try and capture it here to both show how far DataFusion has come as well as how close we are to done
There are three types of predicate pushdown:
- Prune row groups based on statistics (do not fetch or decode any pages)
- Prune column pages based on page level statistics, skip decode of corresponding positions in other columns:
- Prune row indexes based on
Exprpredicates, and skip decode of corresponding positions in other columns
Work Items
- Support
RowFilterinParquetExec#3360 - Add benchmarks for parquet queries with filter pushdown enabled #3457
- Support using offset index in
ParquetRecordBatchStreamwhen pushing downRowFilter#3456 - Implement parquet page-level skipping with column index, using min/max stats #847
- Enable parquet filter pushdown (
filter_pushdown) by default #3463 - Add metadata_size_hint for optimistic fetching of parquet metadata #2946
- Consider adopting IOx ObjectStore abstraction #2489 /
ParquetRecordBatchStream - Implement parquet page-level skipping with column index, using min/ma… #3780
- Enable parquet page level skipping (page index pruning) by default #4085
- Support pushdown multi-columns in PageIndex pruning. #3834
- Support parquet page filtering for more types: String, Binary(Decimal), Int96 #3833
- Add additional testing to parquet predicate pushdown integration tests #4087
- Correctness integration test for parquet filter pushdown #3976
- Write a blog about parquet predicate pushdown #3464
- Support parquet page filtering on min_max for
decimal128andstringcolumns #4255 - Page index pruning fail on complex_expr #4317
Related arrow-rs items:
- Permit parallel fetching of column chunks in
ParquetRecordBatchStreamarrow-rs#2110 - Support peek_next_page() and skip_next_page in serialized_reader. arrow-rs#2044
- Add get_byte_ranges method to AsyncFileReader trait arrow-rs#2115
- Add Parquet RowFilter API arrow-rs#2335
- Add ParquetRecordBatchReaderBuilder (#2427) arrow-rs#2435
- Make Parquet reader filter APIs public (#1792) arrow-rs#2467
- Add Page Row Count Limit arrow-rs#2941
-
parquet::arrow::arrow_writer::ArrowWriterignores page size properties arrow-rs#2853
ovr, liukun4515, aierui and sundy-liliukun4515, yordan-pavlov and aierui
Metadata
Metadata
Assignees
Labels
EPICA larger project, actively underway, with sub tasksA larger project, actively underway, with sub tasksenhancementNew feature or requestNew feature or request