[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots #17680

HyukjinKwon · 2017-04-19T07:17:40Z

What changes were proposed in this pull request?

Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.

With dots

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

+--------+
|col.dots|
+--------+
+--------+

Without dots

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()

+-------+
|coldots|
+-------+
|      1|
+-------+

It seems dot in the column names via FilterApi tries to separate the field name with dot (ColumnPath with multiple column paths) whereas the actual column name is col.dots. (See FilterApi.java#L71 and it calls ColumnPath.java#L44).

I just tried to come up with ways to resolve it and I came up with two as below (as I could not find
a way to set dots as are):

One is simply to don't push down filters when there are dots in column names so that it reads all and filters in Spark-side.
The other way creates Spark's FilterApi for those columns (it seems final) to get always use single column path it in Spark-side (this seems hacky) as we are not pushing down nested columns currently. So, it looks we can get a field name via ColumnPath.get not ColumnPath.fromDotString in this way.

This PR proposes the latter way because I think we need to be sure on that it passes the tests.

After

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

+--------+
|col.dots|
+--------+
|       1|
+--------+

How was this patch tested?

Existing tests should cover this. Some tests were added in ParquetFilterSuite.scala. Manually, I ran related tests and Jenkins tests will cover this.

HyukjinKwon · 2017-04-19T07:19:22Z

cc @ash211, @robert3005 and @liancheng.

@liancheng, do you mind if I ask to review this please?

SparkQA · 2017-04-19T09:31:27Z

Test build #75932 has finished for PR 17680 at commit 297c70b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ash211

This looks like it fixes the issue I reported (last test confirms that) but I'm worried it might have caused a regression in pushdown on struct columns.

ash211 · 2017-04-19T18:53:35Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

is there another existing test that checks that pushdown for struct.field1 syntax works correctly? I'm not sure how to reference those inner fields in a struct field as I don't use it much personally, but want to make sure that's not broken as a result of this change.

Up to my knolwedge, we don't push down filters with nested columns. Let me check if we already have the negative case explicitly and then add it if missing.

thanks for the change -- I wasn't sure if predicate pushdown worked on nested columns and it looks like that change confirms it does not after this change.

ash211 · 2017-04-19T18:54:28Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

please also do the check for IS NULL having 1 row too, so this is symmetric

Actually, IS NULL is not the problem here to users.

val path = "/tmp/abc" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NULL").show()

+--------+ |col.dots| +--------+ | null| +--------+

The reason is Parquet internally produces null permissively if the column does not exist after we upgrade it to 1.8.2 so it evaluates it as true in this case AFAIK. If this reason should be verified, I will look further. But in terms of the output, the issue is not reproduced.

I could add a test only when record-by-record filter is enabled though after stripping the Spark-side filter.

thanks for adding the additional test below

HyukjinKwon · 2017-04-20T02:23:57Z

I added a test case so that we make sure it does not push down filters in FileFormat when there are filters with nested column access. For me, I could not find related test in the test cases in companion module.

HyukjinKwon · 2017-04-20T02:25:06Z

cc @davies too.

SparkQA · 2017-04-20T03:59:27Z

Test build #75962 has finished for PR 17680 at commit 365da42.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TestFileFormatWithNestedSchema extends TestFileFormat

SparkQA · 2017-04-20T04:40:51Z

Test build #75963 has finished for PR 17680 at commit 2c29a6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-20T05:05:14Z

Test build #75964 has finished for PR 17680 at commit 973e9b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ash211

This looks great!

Now we need someone with merge permissions to review.

ash211 · 2017-04-20T18:38:14Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

nit: pushdown

ash211 · 2017-04-20T18:43:54Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

thanks for the change -- I wasn't sure if predicate pushdown worked on nested columns and it looks like that change confirms it does not after this change.

ash211 · 2017-04-20T18:45:05Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

thanks for adding the additional test below

HyukjinKwon · 2017-04-20T22:01:18Z

@ash211, Thanks for your approval.

SparkQA · 2017-04-21T00:08:01Z

Test build #76006 has finished for PR 17680 at commit fdc3943.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-04-22T06:39:20Z

gentle ping @liancheng and @davies.

HyukjinKwon · 2017-04-25T05:39:32Z

@liancheng and @davies, if you are not sure of this way, I could simply avoid to push down the filters in this case for now. Please let me know.

ash211 · 2017-04-28T15:57:20Z

Any further thoughts on this? It was quite surprising for one of our users so I wanted to make sure it was fixed in a future Apache release

HyukjinKwon · 2017-04-29T06:17:47Z

gentle ping @liancheng and @davies

HyukjinKwon · 2017-05-03T12:26:29Z

gentle ping @liancheng and @davies

HyukjinKwon · 2017-05-07T14:58:00Z

friendly ping ...

ash211 · 2017-05-09T23:13:55Z

Are there any comments on this PR or is it ready to be merged?

HyukjinKwon · 2017-05-12T12:50:48Z

gentle ping ...

HyukjinKwon · 2017-05-15T10:35:45Z

gentle ping ...

gatorsmile · 2017-05-15T18:55:42Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Instead of duplicating the codes, please write a helper function.

gatorsmile · 2017-05-15T19:23:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

We need to explain the functions in this object ParquetColumns are based on the codes in org.apache.parquet.filter2.predicate. Thus, when upgrading the Parquet versions, we need to check whether they are still the same.

gatorsmile · 2017-05-15T21:11:39Z

This sounds from the JIRA: https://issues.apache.org/jira/browse/PARQUET-389 (apache/parquet-java#354) after we upgrading to Parquet 1.8.2. Maybe @rdblue can help us review this PR? Thanks!

gatorsmile · 2017-05-15T21:14:22Z

@HyukjinKwon This PR only verifies the behavior when column names have dot. We also need to add the test cases for nested structure.

gatorsmile · 2017-05-15T21:20:59Z

@HyukjinKwon Could we just stop pushing down the predicates that involve the column names containing the dots? It will be a safe/simple fix.

HyukjinKwon · 2017-05-15T22:55:39Z

Yes, it looks related with that, in particular, here #17680 (comment).

Up to my knowledge, we don't support pushing down filters with nested column access and we already have this assumption in ParquetFilters. The test case to check the filter is not pushed for nested access is added here.

Sure, this tries latter case as described in the PR description. Probably, let me open another one for the former approach (simply avoid) as a simple workaround for now.

…lumns

gatorsmile · 2017-05-15T23:34:20Z

Yes. Please open the PR to stop predicate push-down for this corner cases. Will review it when it is done.

SparkQA · 2017-05-16T02:03:35Z

Test build #76951 has finished for PR 17680 at commit 9f2851b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ng dots in the names ## What changes were proposed in this pull request? This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with #17680. The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case. I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here. This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best). Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ +--------+ ``` **Without dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ |coldots| +-------+ | 1| +-------+ ``` **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ | 1| +--------+ ``` ## How was this patch tested? Unit tests added in `ParquetFilterSuite`. Author: hyukjinkwon <[email protected]> Closes #18000 from HyukjinKwon/SPARK-20364-workaround.

…ng dots in the names ## What changes were proposed in this pull request? This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with #17680. The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case. I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here. This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best). Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ +--------+ ``` **Without dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ |coldots| +-------+ | 1| +-------+ ``` **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ | 1| +--------+ ``` ## How was this patch tested? Unit tests added in `ParquetFilterSuite`. Author: hyukjinkwon <[email protected]> Closes #18000 from HyukjinKwon/SPARK-20364-workaround. (cherry picked from commit 8fb3d5c) Signed-off-by: Xiao Li <[email protected]>

rdblue · 2017-05-18T19:49:11Z

@gatorsmile, sorry for not responding, I was on vacation for a few days. Should I still review this even though it is merged?

gatorsmile · 2017-05-18T20:59:08Z

@rdblue Thank you very much! We decided to simply block the filter pushdown for these corner cases. See the merged PR: #18000. Maybe you can share with us the plan of the Parquet community for supporting quoted column names?

rdblue · 2017-05-18T21:14:36Z

There's an open PR (#361), to support quoted column names, but the discussion on the merits of it is on-going. I don't see a huge benefit to supporting . in column names, and I'm concerned that it breaks readers like parquet-avro that specifically don't allow them. More discussion from the Spark community is welcome there.

HyukjinKwon · 2017-05-18T21:17:49Z

That merged PR is a workaround simply avoiding this case and we still should deal with this (in Parquet or Spark). I am closing this because I don't think this is going to be merged soon.

…ng dots in the names ## What changes were proposed in this pull request? This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with apache#17680. The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case. I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here. This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best). Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ +--------+ ``` **Without dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ |coldots| +-------+ | 1| +-------+ ``` **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ | 1| +--------+ ``` ## How was this patch tested? Unit tests added in `ParquetFilterSuite`. Author: hyukjinkwon <[email protected]> Closes apache#18000 from HyukjinKwon/SPARK-20364-workaround.

dbtsai · 2020-03-04T01:06:00Z

This PR enables Parquet predicate pushdown for fields having dots in the names, #27780

ash211 reviewed Apr 19, 2017

View reviewed changes

ash211 approved these changes Apr 20, 2017

View reviewed changes

robert3005 mentioned this pull request Apr 21, 2017

Merge fix for SPARK-20364 palantir/spark#167

Closed

robert3005 mentioned this pull request Apr 25, 2017

[SPARK-20364] Parquet predicate pushdown on columns with dots return empty results palantir/spark#170

Merged

gatorsmile reviewed May 15, 2017

View reviewed changes

HyukjinKwon added 2 commits May 16, 2017 08:22

Parquet predicate pushdown on columns with dots return empty results

13584ee

Add negative case so that it does not push down filters for nested co…

05ada15

…lumns

HyukjinKwon added 3 commits May 16, 2017 08:22

Fix a typo in the test

db113a5

Add record-by-record test too

df65bb6

Fix a typo

45ac9c7

Address comments here first

9f2851b

HyukjinKwon force-pushed the SPARK-20364 branch from fdc3943 to 9f2851b Compare May 15, 2017 23:58

HyukjinKwon mentioned this pull request May 16, 2017

[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names #18000

Closed

HyukjinKwon closed this May 18, 2017

HyukjinKwon deleted the SPARK-20364 branch January 2, 2018 03:42

[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots #17680

[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots #17680

Uh oh!

Conversation

HyukjinKwon commented Apr 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Apr 19, 2017

Uh oh!

SparkQA commented Apr 19, 2017

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 20, 2017

Uh oh!

HyukjinKwon commented Apr 20, 2017

Uh oh!

SparkQA commented Apr 20, 2017

Uh oh!

SparkQA commented Apr 20, 2017

Uh oh!

SparkQA commented Apr 20, 2017

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 20, 2017

Uh oh!

SparkQA commented Apr 21, 2017

Uh oh!

HyukjinKwon commented Apr 22, 2017

Uh oh!

HyukjinKwon commented Apr 25, 2017

Uh oh!

ash211 commented Apr 28, 2017

Uh oh!

HyukjinKwon commented Apr 29, 2017

Uh oh!

HyukjinKwon commented May 3, 2017

Uh oh!

HyukjinKwon commented May 7, 2017

Uh oh!

ash211 commented May 9, 2017

Uh oh!

HyukjinKwon commented May 12, 2017

Uh oh!

HyukjinKwon commented May 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 15, 2017

Uh oh!

gatorsmile commented May 15, 2017

Uh oh!

gatorsmile commented May 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

HyukjinKwon commented Apr 19, 2017 •

edited

Loading

HyukjinKwon Apr 20, 2017 •

edited

Loading

gatorsmile commented May 15, 2017 •

edited

Loading

HyukjinKwon commented May 15, 2017 •

edited

Loading