Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with #17680.

The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.

I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.

This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).

Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.

With dots

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
+--------+
|col.dots|
+--------+
+--------+

Without dots

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
+-------+
|coldots|
+-------+
|      1|
+-------+

After

val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
+--------+
|col.dots|
+--------+
|       1|
+--------+

How was this patch tested?

Unit tests added in ParquetFilterSuite.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 16, 2017

cc @gatorsmile who reviewed the latter way in that PR, @cloud-fan who reviewed some of my previous Parquet related PR, @liancheng and @viirya who I believe are used to this code path, @ash211 who found this issue and @marmbrus who took an action on the JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link to PARQUET-389 in the JIRA SPARK-20364

Copy link
Member

@gatorsmile gatorsmile May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Parquet does not allow dots in the column name because dots are used as a column path delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates with missing columns. The incorrect results could be got from Parquet when we push down filters for the column having dots in the names. Thus, we do not push down such filters. See SPARK-20364.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it looks much nicer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataTypeOf -> nameTypeMap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is a nested function)

@gatorsmile
Copy link
Member

Now, the fix is much safer for merging to 2.2. Will review the test case later.

@SparkQA
Copy link

SparkQA commented May 16, 2017

Test build #76958 has finished for PR 18000 at commit e94560e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented May 16, 2017

In addition, we need to know the limitation of column names in Parquet in the future. See the related PR in Parquet: apache/parquet-java#361

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 16, 2017

^ Yea, maybe. we need to make sure on

def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim)
}

@HyukjinKwon HyukjinKwon reopened this May 16, 2017
@HyukjinKwon HyukjinKwon force-pushed the SPARK-20364-workaround branch 2 times, most recently from ae8ea08 to 20adf72 Compare May 16, 2017 23:19
@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #76987 has finished for PR 18000 at commit ae8ea08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #76989 has finished for PR 18000 at commit 20adf72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that a missing column is treated like a NULL value. The results will be changed only for some predicates, e.g, IsNull and IsNotNull. For other predicates, can we still push down them?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the problem is, it (almost) always evaluates it with NULL when the columns have dots in the names because column paths become nested (a.b not `a.b`) in the Parquet predicate filter up to my knowledge.

You are right for IsNull. I pointed out this in #17680 (comment) as it looks they (almost) always evaluate it to true in Parquet-side but it is filtered in Spark-side. So, for input/output, it is not an issue in this case but I believe we should disable this for this case too.

I think this example explains the case

val dfs = Seq(
  Seq(Some(1), None).toDF("col.dots"),
  Seq(Some(1L), None).toDF("col.dots"),
  Seq(Some(1.0F), None).toDF("col.dots"),
  Seq(Some(1.0D), None).toDF("col.dots"),
  Seq(true, false).toDF("col.dots"),
  Seq("apple", null).toDF("col.dots"),
  Seq("apple", null).toDF("col.dots")
)
 
val predicates = Seq(
  "`col.dots` > 0",
  "`col.dots` >= 1L",
  "`col.dots` < 2.0",
  "`col.dots` <= 1.0D",
  "`col.dots` == true",
  "`col.dots` IS NOT NULL",
  "`col.dots` IS NULL"
)

dfs.zip(predicates).zipWithIndex.foreach { case ((df, predicate), i) =>
  val path = s"/tmp/abcd$i"
  df.write.mode("overwrite").parquet(path)
  spark.read.parquet(path).where(predicate).show()	
}
+--------+
|col.dots|
+--------+
+--------+

+--------+
|col.dots|
+--------+
+--------+

+--------+
|col.dots|
+--------+
+--------+

+--------+
|col.dots|
+--------+
+--------+

+--------+
|col.dots|
+--------+
+--------+

+--------+
|col.dots|
+--------+
+--------+

+--------+
|col.dots|
+--------+
|    null|
+--------+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any way, I believe we should disable because it appears the pushed Parquet filter indicates another column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider other special characters, e.g., those in apache/parquet-java#361?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should disallow when initially loading or writing out if it is still allowed in any way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need the check of filterClass? Why remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Shouldn't generate filter predicate for $pred"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check correctness of the results.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 17, 2017

@gatorsmile, @viirya, I addressed your comments. Could you take another look when you have sometime?

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #77026 has finished for PR 18000 at commit c83f1b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking dots for each predicate recursively in ParquetFilters, we can check Filter.references of the predicate at top level in ParquetFileFormat, and skip ParquetFilters.createFilter at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I expect this is a non-critical path and not executed multiple times. Also, it does not look particularly faster to call, Filter.references -> Filter.findReferences -> Filter.references ... . Another downside (maybe nitpicking) is, this will introduce another small code path that returns None for filter creation failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just for speed. Also for the number of codes needed to change. But I think it is ok for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log something for users because it might not be straightforward for users to know predicate pushdown is disabled for dot-columns. This is bad for performance, it seems to me that it's better to let users know what's happened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. However, we don't already log pushed filters failed to create, e.g., In AFAIK. Probably, we should log in those cases across all the sources. If you don't strongly feel about this, I would like to not log here for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds making sense.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 18, 2017

Just to make sure, I don't feel strongly for both comments @viirya. I am willing to fix if you feel strongly (already have fixed version in my local). Please let me know.

@viirya
Copy link
Member

viirya commented May 18, 2017

Sounds ok for me.

@HyukjinKwon
Copy link
Member Author

Thank you @viirya.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make $ to useUnresolvedAttribute.quotedString?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, actually my initial version in my local included the change for symbol and $ to match them to Column. It also looks making sense per #7969. I believe this is an internal API -

* considered an internal API to Spark SQL and are subject to change between minor releases.
so I guess it would be fine even if it introduces a behaviour change.

Nevertheless, I believe some guys don't like this change much and wanted to avoid such changes here for now (it is single place it needs anyway for now ... ).

@cloud-fan
Copy link
Contributor

a high-level question, is it a parquet bug or Spark doesn't use parquet reader correctly?

@HyukjinKwon
Copy link
Member Author

I would rather like to say it is a limitation in Parquet API. It looks there is no way to set column names having dots in Parquet filters properly. #17680 suggests a hacky workaround to set this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: nameToType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just have a simple end-to-end test? The fix is actually very simple and seems not worth such complex tests to verify it.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just reverted it back and made a simple test.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-20364-workaround branch from c83f1b7 to 1eae64a Compare May 18, 2017 06:14
}
}

test("SPARK-20364: Disable Parquet predicate pushdown for fields having dots in the names") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now.

@viirya
Copy link
Member

viirya commented May 18, 2017

LGTM

@cloud-fan
Copy link
Contributor

LGTM, is parquet going to fix it in the future? or is there any official way to support filter push down for column names with dot?

@gatorsmile
Copy link
Member

Based on the discussion in apache/parquet-java#361, it does not sound Parquet will support it in the short term. We might need to live with it for a long time.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

test this please

@gatorsmile
Copy link
Member

ok to test

@viirya
Copy link
Member

viirya commented May 18, 2017

Seems jenkins doesn't work for now.

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77051 has finished for PR 18000 at commit 1eae64a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2

@asfgit asfgit closed this in 8fb3d5c May 18, 2017
asfgit pushed a commit that referenced this pull request May 18, 2017
…ng dots in the names

## What changes were proposed in this pull request?

This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with #17680.

The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.

I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.

This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).

Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.

**With dots**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```

```
+--------+
|col.dots|
+--------+
+--------+
```

**Without dots**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
```

```
+-------+
|coldots|
+-------+
|      1|
+-------+
```

**After**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```

```
+--------+
|col.dots|
+--------+
|       1|
+--------+
```

## How was this patch tested?

Unit tests added in `ParquetFilterSuite`.

Author: hyukjinkwon <[email protected]>

Closes #18000 from HyukjinKwon/SPARK-20364-workaround.

(cherry picked from commit 8fb3d5c)
Signed-off-by: Xiao Li <[email protected]>
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…ng dots in the names

## What changes were proposed in this pull request?

This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with apache#17680.

The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.

I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.

This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).

Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.

**With dots**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```

```
+--------+
|col.dots|
+--------+
+--------+
```

**Without dots**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
```

```
+-------+
|coldots|
+-------+
|      1|
+-------+
```

**After**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```

```
+--------+
|col.dots|
+--------+
|       1|
+--------+
```

## How was this patch tested?

Unit tests added in `ParquetFilterSuite`.

Author: hyukjinkwon <[email protected]>

Closes apache#18000 from HyukjinKwon/SPARK-20364-workaround.
@HyukjinKwon HyukjinKwon deleted the SPARK-20364-workaround branch January 2, 2018 03:42
@dbtsai
Copy link
Member

dbtsai commented Mar 4, 2020

This PR enables Parquet predicate pushdown for fields having dots in the names, #27780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants