Commit fb2bdea
[SPARK-40918][SQL][3.3] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
### What changes were proposed in this pull request?
We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:
* `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output.
* To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits.
* To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.
### Why are the changes needed?
This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue.
`java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.
The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields.
When this is used in `FileSourceScanExec`:
```
override lazy val supportsColumnar: Boolean = {
relation.fileFormat.supportBatch(relation.sparkSession, schema)
}
```
the `schema` comes from output attributes, which includes extra metadata attributes.
However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as
```
relation.fileFormat.buildReaderWithPartitionValues(
sparkSession = relation.sparkSession,
dataSchema = relation.dataSchema,
partitionSchema = relation.partitionSchema,
requiredSchema = requiredSchema,
filters = pushedDownFilters,
options = options,
hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)
```
Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns:
```
FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
```
Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.
### Does this PR introduce _any_ user-facing change?
Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`.
### How was this patch tested?
Tests added
Backports apache#38397 from juliuszsompolski/SPARK-40918.
Authored-by: Juliusz Sompolski <julekdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
Closes apache#38431 from juliuszsompolski/SPARK-40918-3.3.
Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent c7ef560 commit fb2bdea
File tree
5 files changed
+107
-12
lines changed- sql/core/src
- main/scala/org/apache/spark/sql/execution
- datasources
- orc
- parquet
- test/scala/org/apache/spark/sql/execution/datasources
5 files changed
+107
-12
lines changedLines changed: 9 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
211 | 211 | | |
212 | 212 | | |
213 | 213 | | |
214 | | - | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
215 | 220 | | |
216 | 221 | | |
217 | 222 | | |
| |||
447 | 452 | | |
448 | 453 | | |
449 | 454 | | |
| 455 | + | |
| 456 | + | |
450 | 457 | | |
451 | 458 | | |
452 | 459 | | |
453 | 460 | | |
454 | 461 | | |
455 | 462 | | |
456 | 463 | | |
457 | | - | |
| 464 | + | |
458 | 465 | | |
459 | 466 | | |
460 | 467 | | |
| |||
Lines changed: 13 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
63 | 68 | | |
64 | 69 | | |
65 | 70 | | |
| |||
184 | 189 | | |
185 | 190 | | |
186 | 191 | | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
187 | 200 | | |
188 | 201 | | |
189 | 202 | | |
| |||
Lines changed: 29 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
40 | 39 | | |
41 | 40 | | |
42 | 41 | | |
| |||
100 | 99 | | |
101 | 100 | | |
102 | 101 | | |
103 | | - | |
104 | | - | |
| 102 | + | |
105 | 103 | | |
106 | 104 | | |
107 | 105 | | |
| |||
113 | 111 | | |
114 | 112 | | |
115 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
116 | 126 | | |
117 | 127 | | |
118 | 128 | | |
| |||
124 | 134 | | |
125 | 135 | | |
126 | 136 | | |
127 | | - | |
128 | 137 | | |
129 | 138 | | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
130 | 155 | | |
131 | 156 | | |
132 | 157 | | |
| |||
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Lines changed: 30 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
49 | 48 | | |
50 | 49 | | |
51 | 50 | | |
| |||
169 | 168 | | |
170 | 169 | | |
171 | 170 | | |
172 | | - | |
| 171 | + | |
173 | 172 | | |
174 | 173 | | |
175 | 174 | | |
176 | | - | |
177 | | - | |
| 175 | + | |
178 | 176 | | |
179 | 177 | | |
180 | 178 | | |
| |||
197 | 195 | | |
198 | 196 | | |
199 | 197 | | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
200 | 210 | | |
201 | 211 | | |
202 | 212 | | |
| |||
245 | 255 | | |
246 | 256 | | |
247 | 257 | | |
248 | | - | |
249 | | - | |
250 | 258 | | |
251 | 259 | | |
252 | 260 | | |
| |||
257 | 265 | | |
258 | 266 | | |
259 | 267 | | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
260 | 284 | | |
261 | 285 | | |
262 | 286 | | |
| |||
Lines changed: 26 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
622 | 622 | | |
623 | 623 | | |
624 | 624 | | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
625 | 651 | | |
0 commit comments