[SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38397

juliuszsompolski · 2022-10-26T12:47:20Z

What changes were proposed in this pull request?

We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:

ParquetFileFormat.supportsBatch and OrcFileFormat.supportsBatch returns whether it can, not necessarily will return columnar output.
To return columnar output, an option FileFormat.OPTION_RETURNING_BATCH needs to be passed to buildReaderWithPartitionValues in these two file formats. It should only be set to true if supportsBatch is also true, but it can be set to false if we don't want columnar output nevertheless - this way, FileSourceScanExec can set it to false when there are more than 100 columsn for WSCG, and ParquetFileFormat / OrcFileFormat doesn't have to concern itself about WSCG limits.
To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.

Why are the changes needed?

This explains it for ParquetFileFormat. OrcFileFormat had exactly the same issue.

java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.

The mismatch comes from the fact that ParquetFileFormat.supportBatch depends on WholeStageCodegenExec.isTooManyFields(conf, schema), where the threshold is 100 fields.

When this is used in FileSourceScanExec:

  override lazy val supportsColumnar: Boolean = {
      relation.fileFormat.supportBatch(relation.sparkSession, schema)
  }

the schema comes from output attributes, which includes extra metadata attributes.

However, inside ParquetFileFormat.buildReaderWithPartitionValues it was calculated again as

      relation.fileFormat.buildReaderWithPartitionValues(
        sparkSession = relation.sparkSession,
        dataSchema = relation.dataSchema,
        partitionSchema = relation.partitionSchema,
        requiredSchema = requiredSchema,
        filters = pushedDownFilters,
        options = options,
        hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)

Where requiredSchema and partitionSchema wouldn't include the metadata columns:

FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))

Column like file_path#6388 are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.

Does this PR introduce any user-facing change?

Not a public API change, but it is now required to pass FileFormat.OPTION_RETURNING_BATCH in options to ParquetFileFormat.buildReaderWithPartitionValues. The only user of this API in Apache Spark is FileSourceScanExec.

How was this patch tested?

Tests added

juliuszsompolski · 2022-10-26T12:47:56Z

cc @cloud-fan @rednaxelafx

rednaxelafx

LGTM, thanks a lot for finding and fixing this!

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala

dongjoon-hyun

cc @sunchao and @viirya

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

dongjoon-hyun · 2022-10-28T05:25:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+      options.get(FileFormat.OPTION_RETURNING_BATCH)
+        .getOrElse {
+          throw new IllegalArgumentException(
+            "OPTION_RETURNING_BATCH should always be set for OrcFileFormat." +


nit. Add one space at the end?

"OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + "OPTION_RETURNING_BATCH should always be set for OrcFileFormat. " +

dongjoon-hyun · 2022-10-28T05:26:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+        .getOrElse {
+          throw new IllegalArgumentException(
+            "OPTION_RETURNING_BATCH should always be set for OrcFileFormat." +
+              "To workaround this issue, set spark.sql.orc.enableVectorizedReader=false.")


Is this a correct recommendation? Why not recommend to set OPTION_RETURNING_BATCH?

Is this a correct recommendation? Why not recommend to set OPTION_RETURNING_BATCH?

@dongjoon-hyun passing OPTION_RETURNING_BATCH is something that the developer of the code that called without setting this option can do. For an end user who faces this issue by hitting some code path that doesn't set this, the workaround would be to disable this config. Hence it's called a "workaround" not a "fix".

dongjoon-hyun · 2022-10-28T05:27:19Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+      options.get(FileFormat.OPTION_RETURNING_BATCH)
+        .getOrElse {
+          throw new IllegalArgumentException(
+            "OPTION_RETURNING_BATCH should always be set for ParquetFileFormat." +


Ditto. nit. Add one more space at the end of the message.

dongjoon-hyun

Given that OrcFileFormat has no issue like _metadata columns, I'm wondering why the title implies there is an issue in Orc? I didn't find any proper explanation about ORC issue in the PR description too.

Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

Could you elaborate more about ORC case with !WholeStageCodegenExec.isTooManyFields(conf, schema) in the PR description, @juliuszsompolski ?

juliuszsompolski · 2022-10-28T09:24:04Z

Given that OrcFileFormat has no issue like _metadata columns

@dongjoon-hyun I think OrcFIleFormat has exactly the same issue as ParquetFileFormat, like @cloud-fan pointed out? When there was a column like _metadata.file_path requested for OrcFileFormat, it would also count that column in FileSourceScanExec.supporsBatch, but not count it in OrcFileFormat.supportsBatch during buildReaderWithPartitionValues. The code changes I made to OrcFileFormat exactly mirror what I did to ParquetFileFormat.
I updated the description to descibe "ParquetFileFormat or OrcFileFormat" in various places, but the issue seems exactly the same.

dongjoon-hyun

Ah, you are right. I mislead the context. Thank you, @juliuszsompolski .

juliuszsompolski · 2022-10-28T10:27:39Z

Thanks. I updated the title after fixing the Orc, but forgot to update the description which still was describing about Parquet only.
This bug exists in branch-3.3 as well btw; if it doesn't merge cleanly I can open a separate PR later.

cloud-fan · 2022-10-28T12:59:51Z

thanks, merging to master!

cloud-fan · 2022-10-28T13:00:16Z

unfortunately it conflicts with 3.3, @juliuszsompolski could you open a backport PR? thanks!

…rquetFileFormat on producing columnar output ### What changes were proposed in this pull request? We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows: * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output. * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits. * To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here. ### Why are the changes needed? This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue. `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output. The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields. When this is used in `FileSourceScanExec`: ``` override lazy val supportsColumnar: Boolean = { relation.fileFormat.supportBatch(relation.sparkSession, schema) } ``` the `schema` comes from output attributes, which includes extra metadata attributes. However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as ``` relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = hadoopConf ... val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) ... val returningBatch = supportBatch(sparkSession, resultSchema) ``` Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns: ``` FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388) FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) FileSourceScanExec: partitionSchema: StructType() FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) ``` Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file. ### Does this PR introduce _any_ user-facing change? Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`. ### How was this patch tested? Tests added Closes apache#38397 from juliuszsompolski/SPARK-40918. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

juliuszsompolski · 2022-10-28T17:39:58Z

3.3 PR: #38431

…nd ParquetFileFormat on producing columnar output ### What changes were proposed in this pull request? We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows: * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output. * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits. * To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here. ### Why are the changes needed? This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue. `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output. The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields. When this is used in `FileSourceScanExec`: ``` override lazy val supportsColumnar: Boolean = { relation.fileFormat.supportBatch(relation.sparkSession, schema) } ``` the `schema` comes from output attributes, which includes extra metadata attributes. However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as ``` relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = hadoopConf ... val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) ... val returningBatch = supportBatch(sparkSession, resultSchema) ``` Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns: ``` FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388) FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) FileSourceScanExec: partitionSchema: StructType() FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) ``` Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file. ### Does this PR introduce _any_ user-facing change? Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`. ### How was this patch tested? Tests added Backports #38397 from juliuszsompolski/SPARK-40918. Authored-by: Juliusz Sompolski <julekdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> Closes #38431 from juliuszsompolski/SPARK-40918-3.3. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…rquetFileFormat on producing columnar output ### What changes were proposed in this pull request? We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows: * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output. * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits. * To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here. ### Why are the changes needed? This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue. `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output. The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields. When this is used in `FileSourceScanExec`: ``` override lazy val supportsColumnar: Boolean = { relation.fileFormat.supportBatch(relation.sparkSession, schema) } ``` the `schema` comes from output attributes, which includes extra metadata attributes. However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as ``` relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = hadoopConf ... val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) ... val returningBatch = supportBatch(sparkSession, resultSchema) ``` Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns: ``` FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388) FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) FileSourceScanExec: partitionSchema: StructType() FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) ``` Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file. ### Does this PR introduce _any_ user-facing change? Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`. ### How was this patch tested? Tests added Closes apache#38397 from juliuszsompolski/SPARK-40918. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

SPARK-40918

918c65b

github-actions bot added the SQL label Oct 26, 2022

juliuszsompolski added 2 commits October 26, 2022 15:58

fix compile

d805162

unused import

6b851c2

rednaxelafx approved these changes Oct 26, 2022

View reviewed changes

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala Outdated Show resolved Hide resolved

juliuszsompolski changed the title ~~[SPARK-40918] Mismatch in WSCG.isTooManyFields when using _metadata~~ [SPARK-40918] Mismatch between FileSourceScanExec and ParquetFileFormat on producing columnar output Oct 26, 2022

test name

4e6729e

dongjoon-hyun reviewed Oct 26, 2022

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-40918] Mismatch between FileSourceScanExec and ParquetFileFormat on producing columnar output~~ [SPARK-40918][SQL] Mismatch between FileSourceScanExec and ParquetFileFormat on producing columnar output Oct 26, 2022

cloud-fan reviewed Oct 27, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Oct 27, 2022

View reviewed changes

also for ORC

ab83b1c

juliuszsompolski changed the title ~~[SPARK-40918][SQL] Mismatch between FileSourceScanExec and ParquetFileFormat on producing columnar output~~ [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output Oct 27, 2022

juliuszsompolski added 3 commits October 27, 2022 12:28

missing newline at end of file

d3ff82d

unused import

4fc51e7

read orc as orc

e36bdee

dongjoon-hyun reviewed Oct 28, 2022

View reviewed changes

dongjoon-hyun requested changes Oct 28, 2022

View reviewed changes

space

e6211fc

dongjoon-hyun approved these changes Oct 28, 2022

View reviewed changes

cloud-fan closed this in 77694b4 Oct 28, 2022

juliuszsompolski mentioned this pull request Oct 28, 2022

[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38431

Closed

TheR1sing3un mentioned this pull request Apr 10, 2025

[HUDI-9302] Enable vectorized reading for file slice without log file apache/hudi#13127

Merged

4 tasks

[SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38397

[SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38397

Uh oh!

Conversation

juliuszsompolski commented Oct 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

juliuszsompolski commented Oct 26, 2022

Uh oh!

rednaxelafx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun Oct 28, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 28, 2022

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski Oct 28, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 28, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Oct 28, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Oct 28, 2022

Uh oh!

cloud-fan commented Oct 28, 2022

Uh oh!

cloud-fan commented Oct 28, 2022

Uh oh!

juliuszsompolski commented Oct 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juliuszsompolski commented Oct 26, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading