-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output #38397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rednaxelafx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for finding and fixing this!
...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
Outdated
Show resolved
Hide resolved
| options.get(FileFormat.OPTION_RETURNING_BATCH) | ||
| .getOrElse { | ||
| throw new IllegalArgumentException( | ||
| "OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Add one space at the end?
"OPTION_RETURNING_BATCH should always be set for OrcFileFormat." +
"OPTION_RETURNING_BATCH should always be set for OrcFileFormat. " +
| .getOrElse { | ||
| throw new IllegalArgumentException( | ||
| "OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + | ||
| "To workaround this issue, set spark.sql.orc.enableVectorizedReader=false.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a correct recommendation? Why not recommend to set OPTION_RETURNING_BATCH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a correct recommendation? Why not recommend to set OPTION_RETURNING_BATCH?
@dongjoon-hyun passing OPTION_RETURNING_BATCH is something that the developer of the code that called without setting this option can do. For an end user who faces this issue by hitting some code path that doesn't set this, the workaround would be to disable this config. Hence it's called a "workaround" not a "fix".
| options.get(FileFormat.OPTION_RETURNING_BATCH) | ||
| .getOrElse { | ||
| throw new IllegalArgumentException( | ||
| "OPTION_RETURNING_BATCH should always be set for ParquetFileFormat." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. nit. Add one more space at the end of the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that OrcFileFormat has no issue like _metadata columns, I'm wondering why the title implies there is an issue in Orc? I didn't find any proper explanation about ORC issue in the PR description too.
Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
Could you elaborate more about ORC case with !WholeStageCodegenExec.isTooManyFields(conf, schema) in the PR description, @juliuszsompolski ?
@dongjoon-hyun I think OrcFIleFormat has exactly the same issue as ParquetFileFormat, like @cloud-fan pointed out? When there was a column like |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right. I mislead the context. Thank you, @juliuszsompolski .
|
Thanks. I updated the title after fixing the Orc, but forgot to update the description which still was describing about Parquet only. |
|
thanks, merging to master! |
|
unfortunately it conflicts with 3.3, @juliuszsompolski could you open a backport PR? thanks! |
…rquetFileFormat on producing columnar output
### What changes were proposed in this pull request?
We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:
* `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output.
* To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits.
* To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.
### Why are the changes needed?
This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue.
`java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.
The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields.
When this is used in `FileSourceScanExec`:
```
override lazy val supportsColumnar: Boolean = {
relation.fileFormat.supportBatch(relation.sparkSession, schema)
}
```
the `schema` comes from output attributes, which includes extra metadata attributes.
However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as
```
relation.fileFormat.buildReaderWithPartitionValues(
sparkSession = relation.sparkSession,
dataSchema = relation.dataSchema,
partitionSchema = relation.partitionSchema,
requiredSchema = requiredSchema,
filters = pushedDownFilters,
options = options,
hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)
```
Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns:
```
FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
```
Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.
### Does this PR introduce _any_ user-facing change?
Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`.
### How was this patch tested?
Tests added
Closes apache#38397 from juliuszsompolski/SPARK-40918.
Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
|
3.3 PR: #38431 |
…nd ParquetFileFormat on producing columnar output
### What changes were proposed in this pull request?
We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:
* `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output.
* To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits.
* To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.
### Why are the changes needed?
This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue.
`java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.
The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields.
When this is used in `FileSourceScanExec`:
```
override lazy val supportsColumnar: Boolean = {
relation.fileFormat.supportBatch(relation.sparkSession, schema)
}
```
the `schema` comes from output attributes, which includes extra metadata attributes.
However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as
```
relation.fileFormat.buildReaderWithPartitionValues(
sparkSession = relation.sparkSession,
dataSchema = relation.dataSchema,
partitionSchema = relation.partitionSchema,
requiredSchema = requiredSchema,
filters = pushedDownFilters,
options = options,
hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)
```
Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns:
```
FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
```
Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.
### Does this PR introduce _any_ user-facing change?
Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`.
### How was this patch tested?
Tests added
Backports #38397 from juliuszsompolski/SPARK-40918.
Authored-by: Juliusz Sompolski <julekdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
Closes #38431 from juliuszsompolski/SPARK-40918-3.3.
Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…rquetFileFormat on producing columnar output
### What changes were proposed in this pull request?
We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:
* `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output.
* To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits.
* To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.
### Why are the changes needed?
This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue.
`java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.
The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields.
When this is used in `FileSourceScanExec`:
```
override lazy val supportsColumnar: Boolean = {
relation.fileFormat.supportBatch(relation.sparkSession, schema)
}
```
the `schema` comes from output attributes, which includes extra metadata attributes.
However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as
```
relation.fileFormat.buildReaderWithPartitionValues(
sparkSession = relation.sparkSession,
dataSchema = relation.dataSchema,
partitionSchema = relation.partitionSchema,
requiredSchema = requiredSchema,
filters = pushedDownFilters,
options = options,
hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)
```
Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns:
```
FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
```
Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.
### Does this PR introduce _any_ user-facing change?
Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`.
### How was this patch tested?
Tests added
Closes apache#38397 from juliuszsompolski/SPARK-40918.
Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:
ParquetFileFormat.supportsBatchandOrcFileFormat.supportsBatchreturns whether it can, not necessarily will return columnar output.FileFormat.OPTION_RETURNING_BATCHneeds to be passed tobuildReaderWithPartitionValuesin these two file formats. It should only be set totrueifsupportsBatchis alsotrue, but it can be set tofalseif we don't want columnar output nevertheless - this way,FileSourceScanExeccan set it to false when there are more than 100 columsn for WSCG, andParquetFileFormat/OrcFileFormatdoesn't have to concern itself about WSCG limits.Why are the changes needed?
This explains it for
ParquetFileFormat.OrcFileFormathad exactly the same issue.java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRowwas being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.The mismatch comes from the fact that
ParquetFileFormat.supportBatchdepends onWholeStageCodegenExec.isTooManyFields(conf, schema), where the threshold is 100 fields.When this is used in
FileSourceScanExec:the
schemacomes from output attributes, which includes extra metadata attributes.However, inside
ParquetFileFormat.buildReaderWithPartitionValuesit was calculated again asWhere
requiredSchemaandpartitionSchemawouldn't include the metadata columns:Column like
file_path#6388are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.Does this PR introduce any user-facing change?
Not a public API change, but it is now required to pass
FileFormat.OPTION_RETURNING_BATCHinoptionstoParquetFileFormat.buildReaderWithPartitionValues. The only user of this API in Apache Spark isFileSourceScanExec.How was this patch tested?
Tests added