[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38431

juliuszsompolski · 2022-10-28T17:39:26Z

What changes were proposed in this pull request?

We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:

ParquetFileFormat.supportsBatch and OrcFileFormat.supportsBatch returns whether it can, not necessarily will return columnar output.
To return columnar output, an option FileFormat.OPTION_RETURNING_BATCH needs to be passed to buildReaderWithPartitionValues in these two file formats. It should only be set to true if supportsBatch is also true, but it can be set to false if we don't want columnar output nevertheless - this way, FileSourceScanExec can set it to false when there are more than 100 columsn for WSCG, and ParquetFileFormat / OrcFileFormat doesn't have to concern itself about WSCG limits.
To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.

Why are the changes needed?

This explains it for ParquetFileFormat. OrcFileFormat had exactly the same issue.

java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.

The mismatch comes from the fact that ParquetFileFormat.supportBatch depends on WholeStageCodegenExec.isTooManyFields(conf, schema), where the threshold is 100 fields.

When this is used in FileSourceScanExec:

  override lazy val supportsColumnar: Boolean = {
      relation.fileFormat.supportBatch(relation.sparkSession, schema)
  }

the schema comes from output attributes, which includes extra metadata attributes.

However, inside ParquetFileFormat.buildReaderWithPartitionValues it was calculated again as

      relation.fileFormat.buildReaderWithPartitionValues(
        sparkSession = relation.sparkSession,
        dataSchema = relation.dataSchema,
        partitionSchema = relation.partitionSchema,
        requiredSchema = requiredSchema,
        filters = pushedDownFilters,
        options = options,
        hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)

Where requiredSchema and partitionSchema wouldn't include the metadata columns:

FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))

Column like file_path#6388 are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.

Does this PR introduce any user-facing change?

Not a public API change, but it is now required to pass FileFormat.OPTION_RETURNING_BATCH in options to ParquetFileFormat.buildReaderWithPartitionValues. The only user of this API in Apache Spark is FileSourceScanExec.

How was this patch tested?

Tests added

Backports #38397 from juliuszsompolski/SPARK-40918.

Authored-by: Juliusz Sompolski [email protected]
Signed-off-by: Wenchen Fan [email protected]

…rquetFileFormat on producing columnar output ### What changes were proposed in this pull request? We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows: * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output. * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits. * To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here. ### Why are the changes needed? This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue. `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output. The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields. When this is used in `FileSourceScanExec`: ``` override lazy val supportsColumnar: Boolean = { relation.fileFormat.supportBatch(relation.sparkSession, schema) } ``` the `schema` comes from output attributes, which includes extra metadata attributes. However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as ``` relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = hadoopConf ... val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) ... val returningBatch = supportBatch(sparkSession, resultSchema) ``` Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns: ``` FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388) FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) FileSourceScanExec: partitionSchema: StructType() FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) ``` Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file. ### Does this PR introduce _any_ user-facing change? Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`. ### How was this patch tested? Tests added Closes apache#38397 from juliuszsompolski/SPARK-40918. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

juliuszsompolski · 2022-10-28T17:39:41Z

@cloud-fan

cloud-fan · 2022-10-31T05:58:54Z

all tests passed actually: https://github.com/juliuszsompolski/apache-spark/runs/9189880361

cloud-fan · 2022-10-31T05:59:24Z

thanks, merging to 3.3!

…nd ParquetFileFormat on producing columnar output ### What changes were proposed in this pull request? We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows: * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output. * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits. * To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here. ### Why are the changes needed? This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue. `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output. The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields. When this is used in `FileSourceScanExec`: ``` override lazy val supportsColumnar: Boolean = { relation.fileFormat.supportBatch(relation.sparkSession, schema) } ``` the `schema` comes from output attributes, which includes extra metadata attributes. However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as ``` relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = hadoopConf ... val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) ... val returningBatch = supportBatch(sparkSession, resultSchema) ``` Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns: ``` FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388) FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) FileSourceScanExec: partitionSchema: StructType() FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) ``` Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file. ### Does this PR introduce _any_ user-facing change? Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`. ### How was this patch tested? Tests added Backports #38397 from juliuszsompolski/SPARK-40918. Authored-by: Juliusz Sompolski <julekdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> Closes #38431 from juliuszsompolski/SPARK-40918-3.3. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

juliuszsompolski · 2022-10-31T09:09:57Z

all tests passed actually: https://github.com/juliuszsompolski/apache-spark/runs/9189880361

Yeah, they did in all three runs, but three times in a row it didn't update github status...

juliuszsompolski mentioned this pull request Oct 28, 2022

[SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38397

Closed

github-actions bot added the SQL label Oct 28, 2022

dongjoon-hyun changed the title ~~[SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output~~ [SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output Oct 28, 2022

juliuszsompolski added 2 commits October 29, 2022 20:43

retrigger hanging CI

aca6b6a

retrigger hanging CI

4158f6d

cloud-fan closed this Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38431

[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38431

Uh oh!

juliuszsompolski commented Oct 28, 2022 •

edited

Loading

Uh oh!

juliuszsompolski commented Oct 28, 2022

Uh oh!

cloud-fan commented Oct 31, 2022

Uh oh!

cloud-fan commented Oct 31, 2022

Uh oh!

juliuszsompolski commented Oct 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38431

[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38431

Uh oh!

Conversation

juliuszsompolski commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

juliuszsompolski commented Oct 28, 2022

Uh oh!

cloud-fan commented Oct 31, 2022

Uh oh!

cloud-fan commented Oct 31, 2022

Uh oh!

juliuszsompolski commented Oct 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juliuszsompolski commented Oct 28, 2022 •

edited

Loading