Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,17 @@ class ParquetFileFormat
SQLConf.CASE_SENSITIVE.key,
sparkSession.sessionState.conf.caseSensitiveAnalysis)

// There are two things to note here.
//
// 1. Dictionary filtering has an issue about the predication on null. For this reason,
// This filtering is disabled. See SPARK-26677.
//
// 2. We should disable 'parquet.filter.dictionary.enabled' but
// the 'parquet.filter.stats.enabled' and 'parquet.filter.dictionary.enabled' were
// swapped mistakenly in Parquet side. It should use 'parquet.filter.dictionary.enabled'
// when Spark upgrades Parquet. See PARQUET-1309.
hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix it in the Spark side. Do the swap in Spark. cc @cloud-fan @rxin @rdblue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do. but let me wait for other feedback before I do since it's a Parquet's property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to disable safely, shall we use set instead of setIfUnset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking about that too. I wanted to consider other possibilities like .. users take that risk and enable them (partly also since this is going to backported into Spark 2.4.x).

Both configurations are swapped as of PARQUET-1309 but was wondering if there might be extreme corner cases that users know that issue and intentionally enable and disable it. I think both configurations are pretty much for advanced users, and regular users won't set this or meet this issue anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT on this @rdblue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make sure the fix for this is in Parquet 1.10.1.

As for fixing this problem, I think that Spark should avoid pushing down notEquals expressions or rewrite them to isNull(col) or notEquals(col, "A"). That's going to be much better for performance than disabling dictionary filtering.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so we're targeting the upgrade to Parquet 1.10.1? yea, sounds okay to me. Also, in that way users can also disable parquet.filter.dictionary.enabled explicitly I guess.

BTW, is it something we should enable by default at Parquet side, @rdblue? I see there can be the performance improvement but was wondering how much stable dictionary filtering it is.

Copy link
Contributor

@rdblue rdblue Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary filtering has been available for more than 2 years and Netflix has been using it by default for almost 3 years without problems. The feature is stable.

Keep in mind what a narrow use case hits this bug. First, an entire row group in a file needs to have a just one value and nulls: either the column has just one non-null value for all rows, or a sort or write pattern has clustered null rows with one other value (enough for an entire row group). Next, a query must use null-safe equality and negate it. In my experience, most people don't know what null-safe equality is. Going back to the use cases that produce data like this, the first use case -- only one non-null value -- would probably result in filter like col IS NULL instead. The second write pattern is the problematic one, but how likely is the intersection of data with that write pattern and searching for all values except one with null-safe equality? I don't think it is likely and I think that's why we haven't found this until now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all details. +1 for going 1.10.1. If that's a plan here, I will close this PR in a couple of days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vote has started. Feel free to test the new binaries with this repository URI: https://repository.apache.org/content/repositories/orgapacheparquet-1022/


ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)

// Sets flags for `ParquetToSparkSchemaConverter`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -890,6 +890,21 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext
}
}
}

test("SPARK-26677: negated null-safe equality comparison should not filter matched row groups") {
(true :: false :: Nil).foreach { vectorized =>
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) {
withTempPath { path =>
// Repeated values for dictionary encoding.
Seq(Some("A"), Some("A"), None).toDF.repartition(1)
.write.parquet(path.getAbsolutePath)
val df = spark.read.parquet(path.getAbsolutePath)
checkAnswer(stripSparkFilter(df.where("NOT (value <=> 'A')")), df)
}
}
}
}

}

object TestingUDT {
Expand Down