-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15639][SPARK-16321][SQL] Push down filter at RowGroups level for parquet reader #13701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…-push-down-filter Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
…-push-down-filter
|
@yhuai I am not sure the row-group info is exposed. But I've manually output the |
…-push-down-filter2 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
|
Test build #60628 has finished for PR 13701 at commit
|
|
Test build #60627 has finished for PR 13701 at commit
|
|
@liancheng I've updated the benchmark. Please take a look. Thanks. |
|
Just realized my PR #13728 is related to your PR, especially the description of the two configuration |
|
@gatorsmile yea. |
|
Based on my understanding, this performance result is highly affected by the number of row groups, the number of pruned groups and the number of average rows per group and more. Do you have any thought about how to control the number of rows per group? Based on the test cases in https://github.com/gatorsmile/spark/blob/9967cc72545324e7a542fcf1b49372d977c0011b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L519-L547, it sounds like each group has only one row... This is strange to me. Do you know how to control it? |
|
@gatorsmile No, I have no idea currently. To make sure the pushed down filter is working, I've manually check the |
|
See the comment by @HyukjinKwon the filter is applied more than once. That means, it could be expensive if the filter does not prune a lot of rows. Not sure if my understanding is right. |
|
@rdblue Not sure if you can help us provide more details how to measure the performance benefits? Thank you very much! |
|
ping @liancheng @yhuai I've updated the benchmark results. Please see if you have other thoughts. Thanks! |
|
@liancheng Is it possible to merge into before 2.0 release? |
|
@liancheng @yhuai How about this? Is it ready to merge? Thanks! |
…-push-down-filter2 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
|
ping @liancheng @yhuai again... |
|
Test build #61143 has finished for PR 13701 at commit
|
|
retest this please. |
| case _ => c | ||
| } | ||
| }.getOrElse(c) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a better question is if it is part of the bug fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use metadata in merged schema to mark the optional field (not existing in all partitions), the metadata is lost after resolving. If we don't add them back, the pushed-down filters will be failed due to non existing field error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Could you add a regression test for this (it's easy to forget)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I add a regression test, I find that this change is not needed now. I am sure that it doesn't work without this change when I submitted this PR. Not sure which recent commit helps keeping metadata after resolving. I will add the regression test still.
|
Thank you for the testing. Can you also test the case that a file contains multiple row groups and we can avoid of scanning unneeded ones? Also since it is not fixing a critical bug, let's not merge it into branch-2.0. |
|
@yhuai As I mentioned in the description, I am not sure if we can manipulate row groups as we want, but I have manually tested it to show the actually scanned row numbers. |
|
Test build #61147 has finished for PR 13701 at commit
|
|
Sorry. I am not sure I get it. We can set the row group size to a small size. Then, it will not be hard to create a parquet file having multiple row groups. |
|
@yhuai Ok. I will have it a try. But looks like I can only manually test it? |
|
@gatorsmile, sorry for the delay, I was evidently not getting notifications until I changed some settings yesterday. There are a few tests in Parquet that generate files with test data that would be appropriate. The simplest one is TestReadWriteEncodingStats where you can see how to build a file with at least one column that has stats you can verify filters with. To get it working for row groups, you'd just need to set the row group size to something small, like 8k, and bump up the number of records until you get a few groups. Unfortunately, we don't have any pre-built test cases you can use for a performance baseline, though. I think the best you can do with the data generation approach is validate that row groups are being filtered using an approach like you already have, counting the number of rows. |
|
Test build #63337 has finished for PR 13701 at commit
|
|
ping @davies Please review the latest changes. Thanks. |
| "scanTime" -> SQLMetrics.createTimingMetric(sparkContext, "scan time")) | ||
| "scanTime" -> SQLMetrics.createTimingMetric(sparkContext, "scan time"), | ||
| // Only for test purpose. | ||
| "numRowGroups" -> SQLMetrics.createMetric(sparkContext, "numRowGroups")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we only create this for unit test (manually create in test case)?
|
Test build #63410 has finished for PR 13701 at commit
|
|
Test build #63412 has finished for PR 13701 at commit
|
|
LGTM, could you fix the conflict (should be trivial)? |
…-push-down-filter2 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
|
@davies Thanks. I've fixed it. Waiting for jenkins tests. |
|
Test build #63492 has finished for PR 13701 at commit
|
|
@davies The accumulator could be released by JVM early for optimization. I made new change to prevent it. |
|
Test build #63503 has finished for PR 13701 at commit
|
|
an unrelated test failed... |
|
retest this please. |
|
Test build #63512 has finished for PR 13701 at commit
|
|
@davies Due to the accumulators for vectorized setting "true" and "false" are with the same names "numRowGroups", the API looking for the accumulator will return one among them. So the test becomes unstable. By removing it from |
|
Test build #63532 has finished for PR 13701 at commit
|
|
Merging this into master and 2.0, thanks! |
… for parquet reader
The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.
The benchmark that excludes the time of writing Parquet file:
test("Benchmark for Parquet") {
val N = 500 << 12
withParquetTable((0 until N).map(i => (101, i)), "t") {
val benchmark = new Benchmark("Parquet reader", N)
benchmark.addCase("reading Parquet file", 10) { iter =>
sql("SELECT _1 FROM t where t._1 < 100").collect()
}
benchmark.run()
}
}
`withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.
When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is:
Before this patch:
The retrieved row groups: 8063
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 825 / 1233 2.5 402.6 1.0X
After this patch:
The retrieved row groups: 0
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 306 / 503 6.7 149.6 1.0X
Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value.
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 136 / 238 15.0 66.5 1.0X
After this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 124 / 193 16.5 60.7 1.0X
For non-pushdown case, from the results, I think this patch doesn't affect normal code path.
I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to see if this patch actually filter the row-groups. When running the above benchmark:
After this patch:
`totalRowCount = 0`
Before this patch:
`totalRowCount = 1024000`
Existing tests should be passed.
Author: Liang-Chi Hsieh <[email protected]>
Closes #13701 from viirya/vectorized-reader-push-down-filter2.
(cherry picked from commit 19af298)
Signed-off-by: Davies Liu <[email protected]>
| accu.register(sparkContext, Some("numRowGroups")) | ||
|
|
||
| val df = spark.read.parquet(path).filter("a < 100") | ||
| df.foreachPartition(_.foreach(v => accu.add(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this test? shouldn't accu always be zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SpecificParquetRecordReaderBase, it looks for an accumulator numRowGroups if any and update it with the row group number to read. It is for test purpose only.
Here we force this trivial foreach function refer this accumulator, but doesn't change it, so the executor side can see it.
…f and docs ## What changes were proposed in this pull request? Since [SPARK-15639](#13701), `spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. This PR removes from SQLConf and docs. ## How was this patch tested? Pass the existing Jenkins. Author: Dongjoon Hyun <[email protected]> Closes #19129 from dongjoon-hyun/SPARK-13656.
What changes were proposed in this pull request?
The base class
SpecificParquetRecordReaderBaseused for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.The benchmark that excludes the time of writing Parquet file:
withParquetTablein default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is:
Before this patch:
The retrieved row groups: 8063
After this patch:
The retrieved row groups: 0
Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value.
Before this patch:
After this patch:
For non-pushdown case, from the results, I think this patch doesn't affect normal code path.
I've manually output the
totalRowCountinSpecificParquetRecordReaderBaseto see if this patch actually filter the row-groups. When running the above benchmark:After this patch:
totalRowCount = 0Before this patch:
totalRowCount = 1024000How was this patch tested?
Existing tests should be passed.