[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names #18000

HyukjinKwon · 2017-05-16T13:52:14Z

What changes were proposed in this pull request?

This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with #17680.

The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.

I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.

This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).

Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.

With dots

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

+--------+
|col.dots|
+--------+
+--------+

Without dots

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()

+-------+
|coldots|
+-------+
|      1|
+-------+

After

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

+--------+
|col.dots|
+--------+
|       1|
+--------+

How was this patch tested?

Unit tests added in ParquetFilterSuite.

HyukjinKwon · 2017-05-16T13:55:10Z

cc @gatorsmile who reviewed the latter way in that PR, @cloud-fan who reviewed some of my previous Parquet related PR, @liancheng and @viirya who I believe are used to this code path, @ash211 who found this issue and @marmbrus who took an action on the JIRA.

gatorsmile · 2017-05-16T17:02:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Added the link to PARQUET-389 in the JIRA SPARK-20364

How about?

Parquet does not allow dots in the column name because dots are used as a column path delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates with missing columns. The incorrect results could be got from Parquet when we push down filters for the column having dots in the names. Thus, we do not push down such filters. See SPARK-20364.

Sure, it looks much nicer.

gatorsmile · 2017-05-16T17:09:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

dataTypeOf -> nameTypeMap

gatorsmile · 2017-05-16T17:25:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

(this is a nested function)

gatorsmile · 2017-05-16T17:27:40Z

Now, the fix is much safer for merging to 2.2. Will review the test case later.

SparkQA · 2017-05-16T17:32:43Z

Test build #76958 has finished for PR 18000 at commit e94560e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-16T17:32:50Z

In addition, we need to know the limitation of column names in Parquet in the future. See the related PR in Parquet: apache/parquet-java#361

HyukjinKwon · 2017-05-16T23:09:41Z

^ Yea, maybe. we need to make sure on

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

Lines 565 to 572 in 3bfb639

    
             def checkFieldName(name: String): Unit = { 
        
               // ,;{}()\n\t= and space are special characters in Parquet schema 
        
               checkConversionRequirement( 
        
                 !name.matches(".*[ ,;{}()\n\t=].*"), 
        
                 s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=". 
        
                    |Please use alias to rename it. 
        
                  """.stripMargin.split("\n").mkString(" ").trim) 
        
             }

SparkQA · 2017-05-17T01:25:28Z

Test build #76987 has finished for PR 18000 at commit ae8ea08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-17T01:31:09Z

Test build #76989 has finished for PR 18000 at commit 20adf72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-17T09:02:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

It seems to me that a missing column is treated like a NULL value. The results will be changed only for some predicates, e.g, IsNull and IsNotNull. For other predicates, can we still push down them?

Yes, but the problem is, it (almost) always evaluates it with NULL when the columns have dots in the names because column paths become nested (a.b not `a.b`) in the Parquet predicate filter up to my knowledge.

You are right for IsNull. I pointed out this in #17680 (comment) as it looks they (almost) always evaluate it to true in Parquet-side but it is filtered in Spark-side. So, for input/output, it is not an issue in this case but I believe we should disable this for this case too.

I think this example explains the case

val dfs = Seq( Seq(Some(1), None).toDF("col.dots"), Seq(Some(1L), None).toDF("col.dots"), Seq(Some(1.0F), None).toDF("col.dots"), Seq(Some(1.0D), None).toDF("col.dots"), Seq(true, false).toDF("col.dots"), Seq("apple", null).toDF("col.dots"), Seq("apple", null).toDF("col.dots") ) val predicates = Seq( "`col.dots` > 0", "`col.dots` >= 1L", "`col.dots` < 2.0", "`col.dots` <= 1.0D", "`col.dots` == true", "`col.dots` IS NOT NULL", "`col.dots` IS NULL" ) dfs.zip(predicates).zipWithIndex.foreach { case ((df, predicate), i) => val path = s"/tmp/abcd$i" df.write.mode("overwrite").parquet(path) spark.read.parquet(path).where(predicate).show() }

+--------+ |col.dots| +--------+ +--------+ +--------+ |col.dots| +--------+ +--------+ +--------+ |col.dots| +--------+ +--------+ +--------+ |col.dots| +--------+ +--------+ +--------+ |col.dots| +--------+ +--------+ +--------+ |col.dots| +--------+ +--------+ +--------+ |col.dots| +--------+ | null| +--------+

In any way, I believe we should disable because it appears the pushed Parquet filter indicates another column.

viirya · 2017-05-17T15:00:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Do we need to consider other special characters, e.g., those in apache/parquet-java#361?

I think we should disallow when initially loading or writing out if it is still allowed in any way.

viirya · 2017-05-17T15:13:43Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Don't we need the check of filterClass? Why remove it?

viirya · 2017-05-17T15:15:20Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

"Shouldn't generate filter predicate for $pred"?

We should also check correctness of the results.

HyukjinKwon · 2017-05-17T17:28:00Z

@gatorsmile, @viirya, I addressed your comments. Could you take another look when you have sometime?

SparkQA · 2017-05-17T19:28:48Z

Test build #77026 has finished for PR 18000 at commit c83f1b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-17T23:20:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Instead of checking dots for each predicate recursively in ParquetFilters, we can check Filter.references of the predicate at top level in ParquetFileFormat, and skip ParquetFilters.createFilter at all.

Hm, I expect this is a non-critical path and not executed multiple times. Also, it does not look particularly faster to call, Filter.references -> Filter.findReferences -> Filter.references ... . Another downside (maybe nitpicking) is, this will introduce another small code path that returns None for filter creation failure.

Not just for speed. Also for the number of codes needed to change. But I think it is ok for me.

viirya · 2017-05-17T23:23:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

We should log something for users because it might not be straightforward for users to know predicate pushdown is disabled for dot-columns. This is bad for performance, it seems to me that it's better to let users know what's happened.

Yes, it is. However, we don't already log pushed filters failed to create, e.g., In AFAIK. Probably, we should log in those cases across all the sources. If you don't strongly feel about this, I would like to not log here for now.

Ok. Sounds making sense.

HyukjinKwon · 2017-05-18T01:13:08Z

Just to make sure, I don't feel strongly for both comments @viirya. I am willing to fix if you feel strongly (already have fixed version in my local). Please let me know.

viirya · 2017-05-18T01:21:22Z

Sounds ok for me.

HyukjinKwon · 2017-05-18T01:46:13Z

Thank you @viirya.

viirya · 2017-05-18T04:14:01Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Shall we make $ to useUnresolvedAttribute.quotedString?

Yea, actually my initial version in my local included the change for symbol and $ to match them to Column. It also looks making sense per #7969. I believe this is an internal API -

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala

Line 24 in e9c91ba

* considered an internal API to Spark SQL and are subject to change between minor releases.

so I guess it would be fine even if it introduces a behaviour change.

Nevertheless, I believe some guys don't like this change much and wanted to avoid such changes here for now (it is single place it needs anyway for now ... ).

cloud-fan · 2017-05-18T04:42:28Z

a high-level question, is it a parquet bug or Spark doesn't use parquet reader correctly?

HyukjinKwon · 2017-05-18T04:48:38Z

I would rather like to say it is a limitation in Parquet API. It looks there is no way to set column names having dots in Parquet filters properly. #17680 suggests a hacky workaround to set this.

cloud-fan · 2017-05-18T05:15:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

nit: nameToType

cloud-fan · 2017-05-18T05:23:11Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

can we just have a simple end-to-end test? The fix is actually very simple and seems not worth such complex tests to verify it.

Sure, I just reverted it back and made a simple test.

viirya · 2017-05-18T06:24:50Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

    }
  }
+
+  test("SPARK-20364: Disable Parquet predicate pushdown for fields having dots in the names") {


Looks much better now.

viirya · 2017-05-18T06:27:03Z

LGTM

cloud-fan · 2017-05-18T07:07:49Z

LGTM, is parquet going to fix it in the future? or is there any official way to support filter push down for column names with dot?

gatorsmile · 2017-05-18T07:12:54Z

Based on the discussion in apache/parquet-java#361, it does not sound Parquet will support it in the short term. We might need to live with it for a long time.

gatorsmile · 2017-05-18T07:13:56Z

retest this please

gatorsmile · 2017-05-18T07:21:28Z

test this please

gatorsmile · 2017-05-18T07:21:34Z

ok to test

viirya · 2017-05-18T07:51:25Z

Seems jenkins doesn't work for now.

gatorsmile · 2017-05-18T16:46:07Z

LGTM pending Jenkins

SparkQA · 2017-05-18T17:28:28Z

Test build #77051 has finished for PR 18000 at commit 1eae64a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-18T17:52:52Z

Thanks! Merging to master/2.2

…ng dots in the names ## What changes were proposed in this pull request? This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with #17680. The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case. I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here. This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best). Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ +--------+ ``` **Without dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ |coldots| +-------+ | 1| +-------+ ``` **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ | 1| +--------+ ``` ## How was this patch tested? Unit tests added in `ParquetFilterSuite`. Author: hyukjinkwon <[email protected]> Closes #18000 from HyukjinKwon/SPARK-20364-workaround. (cherry picked from commit 8fb3d5c) Signed-off-by: Xiao Li <[email protected]>

…ng dots in the names ## What changes were proposed in this pull request? This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with apache#17680. The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case. I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here. This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best). Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ +--------+ ``` **Without dots** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ |coldots| +-------+ | 1| +-------+ ``` **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ | 1| +--------+ ``` ## How was this patch tested? Unit tests added in `ParquetFilterSuite`. Author: hyukjinkwon <[email protected]> Closes apache#18000 from HyukjinKwon/SPARK-20364-workaround.

dbtsai · 2020-03-04T01:04:45Z

This PR enables Parquet predicate pushdown for fields having dots in the names, #27780

gatorsmile reviewed May 16, 2017

View reviewed changes

HyukjinKwon closed this May 16, 2017

HyukjinKwon reopened this May 16, 2017

HyukjinKwon force-pushed the SPARK-20364-workaround branch 2 times, most recently from ae8ea08 to 20adf72 Compare May 16, 2017 23:19

viirya reviewed May 17, 2017

View reviewed changes

viirya reviewed May 18, 2017

View reviewed changes

cloud-fan reviewed May 18, 2017

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala Outdated

Copy link

Contributor

cloud-fan May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: nameToType

cloud-fan reviewed May 18, 2017

View reviewed changes

Disable Parquet predicate pushdown for fields having dots in the names

1eae64a

HyukjinKwon force-pushed the SPARK-20364-workaround branch from c83f1b7 to 1eae64a Compare May 18, 2017 06:14

viirya reviewed May 18, 2017

View reviewed changes

asfgit closed this in 8fb3d5c May 18, 2017

gatorsmile mentioned this pull request May 18, 2017

[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots #17680

Closed

HyukjinKwon deleted the SPARK-20364-workaround branch January 2, 2018 03:42

HyukjinKwon mentioned this pull request Mar 4, 2020

[SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots #27780

Closed

[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names #18000

[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names #18000

Uh oh!

Conversation

HyukjinKwon commented May 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented May 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 16, 2017

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

gatorsmile commented May 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 17, 2017

Uh oh!

SparkQA commented May 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 18, 2017

Uh oh!

HyukjinKwon commented May 18, 2017

HyukjinKwon commented May 16, 2017 •

edited

Loading

gatorsmile May 16, 2017 •

edited

Loading

gatorsmile commented May 16, 2017 •

edited

Loading

HyukjinKwon commented May 16, 2017 •

edited

Loading

HyukjinKwon May 17, 2017 •

edited

Loading

HyukjinKwon commented May 17, 2017 •

edited

Loading

HyukjinKwon commented May 18, 2017 •

edited

Loading

HyukjinKwon May 18, 2017 •

edited

Loading