[SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet #22197

yucai · 2018-08-23T08:04:47Z

What changes were proposed in this pull request?

Currently, filter pushdown will not work if Parquet schema and Hive metastore schema are in different letter cases even spark.sql.caseSensitive is false.

Like the below case:

spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 1024)
spark.range(1, 40 * 1024 * 1024, 1, 1).sortWithinPartitions("id").write.parquet("/tmp/t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/t'")
sql("select * from t where id < 100L").write.csv("/tmp/id")

Although filter "ID < 100L" is generated by Spark, it fails to pushdown into parquet actually, Spark still does the full table scan when reading.
This PR provides a case-insensitive field resolution to make it work.

Before - "ID < 100L" fail to pushedown:

After - "ID < 100L" pushedown sucessfully:

How was this patch tested?

Added UTs.

…wn when reading Parquet

SparkQA · 2018-08-23T08:30:10Z

Test build #95147 has finished for PR 22197 at commit 5902afe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-23T14:31:30Z

Test build #95149 has finished for PR 22197 at commit 2226eae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-25T04:54:08Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

          // converted (`ParquetFilters.createFilter` returns an `Option`). That's why a `flatMap`
          // is used here.
-          .flatMap(parquetFilters.createFilter(parquetSchema, _))
+          .flatMap(parquetFilters.createFilter(parquetSchema, _, isCaseSensitive))


can we pass this config when creating ParquetFilters?

Yes, that way is better.

SparkQA · 2018-08-25T11:49:57Z

Test build #95247 has finished for PR 22197 at commit c76189d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-25T12:49:12Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+    assertResult(None) {
+      caseInsensitiveParquetFilters.createFilter(
+        dupParquetSchema, sources.EqualTo("CINT", 1000))
+    }


Can we add one negative test that having name names in case insensitive modes, for example, cInt, CINT and check if that throws an exception?

Added, thanks!

cloud-fan · 2018-08-25T12:53:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala


  /**
-   * Returns a map from name of the column to the data type, if predicate push down applies.
+   * Returns nameMap and typeMap based on different case sensitive mode, if predicate push


instead of returning 2 maps, can we just add a originalName field to ParquetSchemaType?

+1 for avoiding returning 2maps if possible.

Great idea!

yucai · 2018-08-25T15:40:57Z

@cloud-fan @HyukjinKwon Seem cannot simply add originalName into ParquetSchemaType.

Because we need exact ParquetSchemaType info for type match, like:

  private val ParquetByteType = ParquetSchemaType(INT_8, INT32, 0, null)
  private val ParquetShortType = ParquetSchemaType(INT_16, INT32, 0, null)
  private val ParquetIntegerType = ParquetSchemaType(null, INT32, 0, null)
  ...
private val makeEq: = {
  case ParquetByteType | ParquetShortType | ParquetIntegerType =>

I use a new case class ParquetField to reduce the map.

  private case class ParquetField(
      name: String,
      schema: ParquetSchemaType)

Let me know if you are OK with this way.

SparkQA · 2018-08-25T19:24:49Z

Test build #95253 has finished for PR 22197 at commit 10cd89f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-26T01:37:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+    } else {
+      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
+      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
+      // See: SPARK-25132.


If we don't need to consider ambiguity, can't we just lowercase f.getName above instead of doing dedup here?

It is a good question!

Let's see the scenario like below:

parquet file has duplicate fields "a INT, A INT".

user wants to pushdown "A > 0".

Without dedup, we possible pushdown "a > 0" instead of "A > 0",
although it is wrong, it will still trigger the Exception finally when reading parquet,
so whether dedup or not, we will get the same result.

@cloud-fan , @gatorsmile any idea?

can we do the dedup before parquet filter pushdown and parquet column pruning? Then we can simplify the code in both cases.

ping @yucai

@cloud-fan, it is a great idea, thanks!
I think it is not to "dedup" before pushdown and pruning.
Maybe we should do parquet schema clip before pushdown and pruning.
If duplicated fields are detected, throw the exception.
If not, pass clipped parquet schema via hadoopconf to parquet lib.

catalystRequestedSchema = { val conf = context.getConfiguration val schemaString = conf.get(ParquetReadSupport.SPARK_ROW_REQUESTED_SCHEMA) assert(schemaString != null, "Parquet requested schema not set.") StructType.fromString(schemaString) } val caseSensitive = context.getConfiguration.getBoolean(SQLConf.CASE_SENSITIVE.key, SQLConf.CASE_SENSITIVE.defaultValue.get) val parquetRequestedSchema = ParquetReadSupport.clipParquetSchema( context.getFileSchema, catalystRequestedSchema, caseSensitive)

I am trying this way, will update soon.

viirya · 2018-08-26T01:40:26Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+    val caseInsensitiveParquetFilters =
+      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
+        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
+        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)


nit: add a method like:

def createParquetFilter(caseSensitive: Boolean) = { new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp, conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith, conf.parquetFilterPushDownInFilterThreshold, caseSensitive = caseSensitive) }

Good idea, thanks!

viirya · 2018-08-26T01:42:23Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+    testCaseInsensitiveResolution(
+      schema,
+      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
+      sources.GreaterThanOrEqual("CINT", 1000))


nit: maybe we don't need to test against so many predicate. We just want to make sure case insensitive resolution work.

Each test is corresponding to one line code change in createFilter. Like:

case sources.IsNull(name) if canMakeFilterOn(name, null) => makeEq.lift(fieldMap(name).schema).map(_(fieldMap(name).name, null))

All tests together can cover all my change in createFilter.

viirya · 2018-08-26T01:44:11Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+  }
+
+  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet" +
+    " - exception when duplicate fields in case-insensitive mode") {


nit: We can have just exception when duplicate fields in case-insensitive mode as test title. Original one is too verbose.

viirya · 2018-08-26T01:46:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+    caseSensitive: Boolean) {
+
+  private case class ParquetField(
+      name: String,


resolvedName? This name and the name in schema looks confused in following code.

gatorsmile · 2018-08-26T03:33:30Z

@dongjoon-hyun Do you think we face the same issue in ORC?

gatorsmile · 2018-08-26T03:49:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

   */
  def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
-    val nameToType = getFieldMap(schema)
+    val nameToParquet = getFieldMap(schema)


-> nameToParquetField

gatorsmile · 2018-08-26T03:57:30Z

This PR is basically trying to resolve case sensitivity when the logical schema and physical schema do not match. This sounds like a general issue in all the data sources. Could any of you do us a favor? Check whether all the built-in data sources respect the conf spark.sql.caseSensitive in this case?

gatorsmile · 2018-08-26T03:58:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+    pushDownInFilterThreshold: Int,
+    caseSensitive: Boolean) {
+
+  private case class ParquetField(


Add a description about these two fields? It is confusing what is resolvedName for the future code maintainer.

yucai · 2018-08-26T04:27:13Z

@gatorsmile I can help check spark.sql.caseSensitive for all the built-in data sources.

SparkQA · 2018-08-26T06:53:07Z

Test build #95256 has finished for PR 22197 at commit 1ea94cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-26T07:05:02Z

Test build #95258 has finished for PR 22197 at commit 10c437e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-26T07:05:02Z

Test build #95257 has finished for PR 22197 at commit 90b8717.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yucai · 2018-08-26T14:38:22Z

retest this please

SparkQA · 2018-08-26T18:32:50Z

Test build #95264 has finished for PR 22197 at commit 10c437e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-27T00:43:29Z

Thanks, @yucai . Could you rebase your code to master branch and update the PR description? Also please update SPARK-25207 together.

According to your example, this issue is a general regression introduced at Spark 2.4. It's not specific to schema mismatch case. For example, in the following schema matched case, the input size is less than or equal to 8.0 MB in Spark 2.3.1, but now master seems to show the following.

spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 1024)
spark.range(1, 40 * 1024 * 1024, 1, 1).sortWithinPartitions("id").write.mode("overwrite").parquet("/tmp/t")
sql("CREATE TABLE t (id LONG) USING parquet LOCATION '/tmp/t'")
// It should be less than and equal to 8MB.
sql("select * from t where id < 100L").show()      
// It's already less than and equal to 8MB.
sql("select * from t where id < 100L").write.mode("overwrite").csv("/tmp/id")

Also, if you don't mind, could you update the PR description? This PR doesn't generate new filters here. This only changes field resolution logic. With and without this PR, there exists filters and there occurs filter push-downs. If I'm wrong, please correct me.

- No filter will be pushdown.
+ Wrong filters will be pushed down.

HyukjinKwon · 2018-08-27T01:51:03Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+  }
+
+  test("SPARK-25207: exception when duplicate fields in case-insensitive mode") {
+    withTempDir { dir =>


nit: withTempPath

dongjoon-hyun · 2018-08-27T02:48:09Z

@gatorsmile . I don't think so we have this regression on ORC data source.
However, there was another JIRA report, SPARK-25175 5 days ago. There was not much details on it. So, I'm still monitoring that issue.

yucai · 2018-08-27T10:07:26Z

@dongjoon-hyun In the schema matched case as you listed, it is expected behavior in current master.

spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 1024)
spark.range(1, 40 * 1024 * 1024, 1, 1).sortWithinPartitions("id").write.mode("overwrite").parquet("/tmp/t")
sql("CREATE TABLE t (id LONG) USING parquet LOCATION '/tmp/t'")

// master and 2.3 have different plan for top limit (see below), that's why 28.4 MB are read in master
sql("select * from t where id < 100L").show()

This difference is probably introduced by #21573, @cloud-fan, current master read more data than 2.3 for top limit like in #22197 (comment) , is it a regression or not?

Master:

2.3 branch:

gatorsmile · 2018-08-29T20:44:42Z

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name c1 in the table definition.
	at org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:952)
	at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:216)
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:495)
	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
	... 88 more

cloud-fan · 2018-08-30T01:46:57Z

Is it acceptable?

apparently not...

OK let's just check duplicated filed names twice: one in filter pushdown, one in column pruning. And clean it up in followup PRs.

This reverts commit cb03fb7.

This reverts commit 9142f49.

This reverts commit 5b2bd93.

yucai · 2018-08-30T03:01:10Z

@dongjoon-hyun Sorry for the late response, description is changed to:

Although filter "ID < 100L" is generated by Spark, it fails to pushdown into parquet actually, Spark still does the full table scan when reading.
This PR provides a case-insensitive field resolution to make it work.

Let me know if you have any suggestion :).

yucai · 2018-08-30T03:03:15Z

@cloud-fan I reverted to the previous version.

SparkQA · 2018-08-30T04:27:32Z

Test build #95454 has finished for PR 22197 at commit 04b88c5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-30T04:43:28Z

Test build #95449 has finished for PR 22197 at commit cb03fb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yucai · 2018-08-30T04:51:35Z

retest this please

cloud-fan · 2018-08-30T05:19:04Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+
+        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
+          val e = intercept[SparkException] {
+            sql(s"select a from $tableName where b > 0").collect()


can we read this table with case-sensitive mode?

Yes, we can, see below.

val tableName = "test" val tableDir = "/tmp/data" spark.conf.set("spark.sql.caseSensitive", true) spark.range(10).selectExpr("id as A", "2 * id as B", "3 * id as b").write.mode("overwrite").parquet(tableDir) sql(s"DROP TABLE $tableName") sql(s"CREATE TABLE $tableName (A LONG, B LONG) USING PARQUET LOCATION '$tableDir'") scala> sql("select A from test where B > 0").show +---+ | A| +---+ | 7| | 8| | 9| | 2| | 3| | 4| | 5| | 6| | 1| +---+

Let me add one test case.

nit: to be consistent with the following query, I'd make this query as select A from $tableName where B > 0 too.

SparkQA · 2018-08-30T07:05:01Z

Test build #95455 has finished for PR 22197 at commit 04b88c5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-30T07:05:02Z

Test build #95456 has finished for PR 22197 at commit 41a7b83.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-30T10:34:16Z

retest this please

cloud-fan · 2018-08-30T10:37:51Z

LGTM

SparkQA · 2018-08-30T14:32:24Z

Test build #95462 has finished for PR 22197 at commit 41a7b83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yucai · 2018-08-31T01:54:24Z

@cloud-fan, tests have passed. And I will use a followup PR to make it cleaner.

HyukjinKwon · 2018-08-31T02:59:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+    caseSensitive: Boolean) {
+
+  private case class ParquetField(
+      // field name in parquet file


I'd just move those into the doc for this case class above, for instance,

/** * blabla * @param blabla */ private case class ParquetField

HyukjinKwon · 2018-08-31T03:00:24Z

Seems fine to me too.

SparkQA · 2018-08-31T07:05:02Z

Test build #95517 has finished for PR 22197 at commit e0d6196.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-31T07:06:52Z

retest this please

viirya · 2018-08-31T07:14:26Z

One minor comment that can be addressed in a follow-up PR. LGTM.

SparkQA · 2018-08-31T10:59:05Z

Test build #95524 has finished for PR 22197 at commit e0d6196.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-31T11:24:36Z

thanks, merging to master!

[SPARK-25207][SQL] Case-insensitve field resolution for filter pushdo…

5902afe

…wn when reading Parquet

fix "org.apache.spark.sql.avro.AvroSuite.convert formats"

2226eae

cloud-fan reviewed Aug 25, 2018

View reviewed changes

address comments

c76189d

HyukjinKwon reviewed Aug 25, 2018

View reviewed changes

cloud-fan reviewed Aug 25, 2018

View reviewed changes

address comments

10cd89f

viirya reviewed Aug 26, 2018

View reviewed changes

address comments

1ea94cc

gatorsmile reviewed Aug 26, 2018

View reviewed changes

yucai added 2 commits August 26, 2018 12:58

address comments

90b8717

rename fieldSchema to fieldType

10c437e

HyukjinKwon reviewed Aug 27, 2018

View reviewed changes

modify ParuqetSchemaSuite

cb03fb7

yucai added 4 commits August 30, 2018 10:27

Revert "modify ParuqetSchemaSuite"

86a0cb0

This reverts commit cb03fb7.

Revert "improve uts"

29a5804

This reverts commit 9142f49.

Revert "dedup first"

db49461

This reverts commit 5b2bd93.

address comments

04b88c5

cloud-fan reviewed Aug 30, 2018

View reviewed changes

improve test

41a7b83

HyukjinKwon reviewed Aug 31, 2018

View reviewed changes

address comments

e0d6196

asfgit closed this in 8d9495a Aug 31, 2018

cloud-fan mentioned this pull request Mar 9, 2019

[SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source #24041

Closed

[SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet #22197

[SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet #22197

Uh oh!

Conversation

yucai commented Aug 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yucai commented Aug 25, 2018

Uh oh!

SparkQA commented Aug 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yucai commented Aug 26, 2018

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

yucai commented Aug 26, 2018

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

dongjoon-hyun commented Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yucai commented Aug 23, 2018 •

edited

Loading

dongjoon-hyun commented Aug 27, 2018 •

edited

Loading