Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
}

override def getFileExtension(context: TaskAttemptContext): String = {
".csv" + CodecStreams.getCompressionExtension(context)
csvOptions.fileExtension + CodecStreams.getCompressionExtension(context)
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,8 @@ class CSVOptions(

val inputBufferSize = 128

val fileExtension = parameters.getOrElse("fileExtension", ".csv")

val isCommentSet = this.comment != '\u0000'

def asWriterSettings: CsvWriterSettings = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
}
}

test("save tsv with tsv suffix") {
withTempDir { dir =>
val csvDir = new File(dir, "csv").getCanonicalPath
val cars = spark.read
.format("csv")
.option("header", "true")
.load(testFile(carsFile))

cars.coalesce(1).write
.option("header", "true")
.option("fileExtension", ".tsv")
.option("delimiter", "\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When delimiter is set to \t, is it still a CSV? : )

Ref: https://en.wikipedia.org/wiki/Tab-separated_values

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have really hard time understanding what you asking here. Could you be more specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the whole API and implementation is called "CSV" still. Of course, you can already read/write files with different names and different delimiters. Does this matter enough to make a new option? what if I delimit with, eh, null characters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen and @gatorsmile, the reason why I pushed this pull request is actually to omit the file extension completely. I guess we can discuss the semantics of different delimiters and file formats but the whole point of this pull request was to give users the option to change a hard coded value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user.

I added those extentions long ago and one of the motivation was auto detection of datasource like Haddop does (which we ended up with not adding it yet due to the cost of listing files and etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason why Hive introduced the conf hive.output.file.extension?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is your usage scenario? It sounds like you want to omit the extension?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile, Yes, my use case is to omit the extension, but I decided to make the implementation flexible i.e. .option("fileExtension", "")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what is the reason you need to omit the extension?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile, We provide data files to our clients and specify the file format to TAB separated. I want to avoid all confusion where someone receives just the dataset and confuses the data to be COMMA separated.

.csv(csvDir)

val tsvFiles = new File(csvDir).listFiles()
assert(tsvFiles.exists(_.getName.endsWith(".tsv")))

val carsCopy = spark.read
.option("header", "true")
.option("delimiter", "\t")
.csv(csvDir)

verifyCars(carsCopy, withHeader = true)
}
}
test("SPARK-13543 Write the output as uncompressed via option()") {
val extraOptions = Map(
"mapreduce.output.fileoutputformat.compress" -> "true",
Expand Down