Skip to content

Wrong results when parquet page index filtering is enabled #4002

@alamb

Description

@alamb

Describe the bug
When I enable page index filtering incorrect answers result

NOTE that page index filtering is not enabled by default (as we are still working on it) so this issue will not likely affect users:

To Reproduce

  1. Download data from repro.zip
  2. Run datafusion CLI:

Expected behavior
Same answer should be produced with and without page index filtering enabled. However, the answers are different

Without page index 15963 rows are produced

(arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=false datafusion-cli -f script.sql 
DataFusion CLI v13.0.0
0 rows in set. Query took 0.001 seconds.
+-------------------------------------------------+---------+
| name                                            | setting |
+-------------------------------------------------+---------+
| datafusion.execution.batch_size                 | 8192    |
| datafusion.execution.coalesce_batches           | true    |
| datafusion.execution.coalesce_target_batch_size | 4096    |
| datafusion.execution.parquet.enable_page_index  | false   |
| datafusion.execution.parquet.pushdown_filters   | false   |
| datafusion.execution.parquet.reorder_filters    | false   |
| datafusion.execution.time_zone                  | UTC     |
| datafusion.explain.logical_plan_only            | false   |
| datafusion.explain.physical_plan_only           | false   |
| datafusion.optimizer.filter_null_join_keys      | false   |
| datafusion.optimizer.max_passes                 | 3       |
| datafusion.optimizer.skip_failed_rules          | true    |
+-------------------------------------------------+---------+
12 rows in set. Query took 0.001 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 53819           |
+-----------------+
1 row in set. Query took 0.002 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 15963           |
+-----------------+
1 row in set. Query took 0.002 seconds.

WITH page filtering, 0 rows are produced 😱

(arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=true datafusion-cli -f script.sql 
DataFusion CLI v13.0.0
0 rows in set. Query took 0.001 seconds.
+-------------------------------------------------+---------+
| name                                            | setting |
+-------------------------------------------------+---------+
| datafusion.execution.batch_size                 | 8192    |
| datafusion.execution.coalesce_batches           | true    |
| datafusion.execution.coalesce_target_batch_size | 4096    |
| datafusion.execution.parquet.enable_page_index  | true    |
| datafusion.execution.parquet.pushdown_filters   | false   |
| datafusion.execution.parquet.reorder_filters    | false   |
| datafusion.execution.time_zone                  | UTC     |
| datafusion.explain.logical_plan_only            | false   |
| datafusion.explain.physical_plan_only           | false   |
| datafusion.optimizer.filter_null_join_keys      | false   |
| datafusion.optimizer.max_passes                 | 3       |
| datafusion.optimizer.skip_failed_rules          | true    |
+-------------------------------------------------+---------+
12 rows in set. Query took 0.001 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 53819           |
+-----------------+
1 row in set. Query took 0.002 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0               |
+-----------------+
1 row in set. Query took 0.002 seconds.

Additional context
I found this issue and reproducer while working on the integration test #3976

I suspect @Ted-Jiang is already working on this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions