Skip to content

Commit c4b0639

Browse files
HyukjinKwondongjoon-hyun
authored andcommitted
[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding
### What changes were proposed in this pull request? This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at #15813 (comment). We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone. There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different). ### Why are the changes needed? For consistency, potentially better performance, and fixing a potentially very corner case bug. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Existing tests should cover. Closes #29063 from HyukjinKwon/SPARK-32270. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent c56c84a commit c4b0639

File tree

1 file changed

+14
-12
lines changed

1 file changed

+14
-12
lines changed

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -150,21 +150,23 @@ object TextInputCSVDataSource extends CSVDataSource {
150150
inputPaths: Seq[FileStatus],
151151
options: CSVOptions): Dataset[String] = {
152152
val paths = inputPaths.map(_.getPath.toString)
153+
val df = sparkSession.baseRelationToDataFrame(
154+
DataSource.apply(
155+
sparkSession,
156+
paths = paths,
157+
className = classOf[TextFileFormat].getName,
158+
options = options.parameters
159+
).resolveRelation(checkFilesExist = false))
160+
.select("value").as[String](Encoders.STRING)
161+
153162
if (Charset.forName(options.charset) == StandardCharsets.UTF_8) {
154-
sparkSession.baseRelationToDataFrame(
155-
DataSource.apply(
156-
sparkSession,
157-
paths = paths,
158-
className = classOf[TextFileFormat].getName,
159-
options = options.parameters
160-
).resolveRelation(checkFilesExist = false))
161-
.select("value").as[String](Encoders.STRING)
163+
df
162164
} else {
163165
val charset = options.charset
164-
val rdd = sparkSession.sparkContext
165-
.hadoopFile[LongWritable, Text, TextInputFormat](paths.mkString(","))
166-
.mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
167-
sparkSession.createDataset(rdd)(Encoders.STRING)
166+
sparkSession.createDataset(df.queryExecution.toRdd.map { row =>
167+
val bytes = row.getBinary(0)
168+
new String(bytes, 0, bytes.length, charset)
169+
})(Encoders.STRING)
168170
}
169171
}
170172
}

0 commit comments

Comments
 (0)