-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader #13371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #59549 has finished for PR 13371 at commit
|
|
retest this please. |
|
Test build #59550 has finished for PR 13371 at commit
|
|
also cc @yhuai |
| new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId) | ||
|
|
||
| // Try to push down filters when filter push-down is enabled. | ||
| // Notice: This push-down is RowGroups level, not individual records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide link to the doc saying it is row group level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it is not obvious to know this is just for row group level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does parquet support row group level predicate evaluation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups in SpecificParquetRecordReaderBase to do filtering.
The implementation of RowGroupFilter is at https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java.
From this, Looks like it does filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, as we use the metadata in merged schema to figure out if a field is optional (i.e. not in all parquet files) or not to decide to push down a filter regarding it, this info has been ignored in FileSourceStrategy now. Without the fixing in this change, the push-down row-group level filtering will be failed due to not existing field in parquet file.
|
Can you provide a test case that shows the problem? Also, can you provide benchmark results of the performance benefit? |
|
@yhuai As you can see, this is not to fix a bug/problem. So I think it might be hard to provide a test case for it. I will try to do the benchmark. |
|
BTW, I can't see any reason not to add a row-group level filter for parquet. |
|
It is a good idea to add it if parquet supports it (I have an impression that parquet does not support it. But maybe I am wrong). I think having benchmark results is a good practice, so we can avoid it hit any obvious issue. |
|
@yhuai I've run a simple benchmark as following: Before this patch: After this patch: |
|
ping @yhuai I've addressed the comments. Please take a look again. Thanks! |
|
ping @yhuai again |
|
cc @rxin Can you also take a look of this? This is staying for a while too. Thanks! |
|
cc @cloud-fan too. |
…-push-down-filter Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
|
ping @yhuai @rxin @cloud-fan |
|
Test build #60246 has finished for PR 13371 at commit
|
|
Is this a bug fix or performance fix? Sorry I don't really understand after reading your description. |
|
It is not really a bug fix because without this filtering push-down, the thing still works. This should be a performance fix. I should modify the description. |
|
retest this please. |
|
The description is updated. |
|
Test build #60256 has finished for PR 13371 at commit
|
|
@viirya I took a look at parquet's code. Seems parquet only evaluate row group level filters when generating splits (https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L673). With FileSourceStrategy in Spark, I am not sure we will actually evaluate filter unneeded row groups as expected. Can you take a look? Also, it will be great if you can have a test to make sure that we actually can skip unneeded row groups. This test can be created as follows.
|
|
@yhuai Parquet also does this filtering at ParquetRecordReader (https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L160) and ParquetReader(https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L147). In Spark, we also did this at SpecificParquetRecordReaderBase ( Line 103 in f958c1c
I've manually tested it as I mentioned above. But it should be good to have a formal test case for it as you said. I will try to add it later, maybe when I come back to work few days later... |
|
@yhuai Your step 3 may not work. We are going to filter the row groups for each parquet file to read in |
|
@yhuai We used to support row group level filter push-down before refactoring And yes, both This LGTM. Thanks for fixing it! Merging to master and 2.0. |
…quet reader ## What changes were proposed in this pull request? The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader. ## How was this patch tested? Existing tests should be passed. Author: Liang-Chi Hsieh <[email protected]> Closes #13371 from viirya/vectorized-reader-push-down-filter. (cherry picked from commit bba5d79) Signed-off-by: Cheng Lian <[email protected]>
|
I just talked to @liancheng offline. I don't think we should've merged this until we have verified there is no performance regression, and we definitely shouldn't have merged this in 2.0. @liancheng can you revert this from both master and branch-2.0? @viirya can you run some parquet scan benchmark and make sure this does not result in perf regression? |
|
To be more clear, please write a proper benchmark that reads data when filter push down is not useful to compare whether this regress performance for the non-push-down case. Also make sure the benchmark does not include the time it takes to write the parquet data. |
|
And once we have more data, it might make sense to merge this in 2.0! |
|
@rxin One thing needs to be explain is, because we just have one configuration to control filter push down, it affects row-based filter push down and this row-group filter push down. The benchmark I posted above is running it against this patch (so with both two push down) and master branch (only row-based, without this patch) individually. Of course it includes the time to write the parquet data, I will change it. I want to confirm if this kind of benchmark is enough? |
|
Reverted from master and branch-2.0. @viirya For the benchmark, there are two things:
|
|
@liancheng Got it. |
|
I rerun the benchmark that excludes the time of writing Parquet file:
After this patch: Before this patch: Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. After this patch: Before this patch: For non-pushdown case, from the results, I think this patch doesn't affect normal code path. |
|
Can you add results showing that there are skipped row groups with this change (and before this patch all row groups are loaded)? For those results, let's also put them in the description of the new PR. |
|
@yhuai ok. Do you mean I need to create a new PR for this? |
|
Yea. Since this one was closed by asfgit, I am not sure you can reopen it. On Wed, Jun 15, 2016 at 7:39 PM -0700, "Liang-Chi Hsieh" [email protected] wrote: @yhuai ok. Do you mean I need to create a new PR for this? — |
|
@viirya One problem in your new benchmark code is that Anyway, So the generated Parquet file probably only contains a single row group, I guess that's why the numbers are so close to each other no matter you enable row group filter push-down or not. |
|
@liancheng Thanks! I didn't notice that. I will rerun the benchmark. I've re-submitted this PR at #13701. |
What changes were proposed in this pull request?
The base class
SpecificParquetRecordReaderBaseused for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.How was this patch tested?
Existing tests should be passed.