Skip to content

Conversation

@mikkokupsu
Copy link

What changes were proposed in this pull request?

Add new option to CSV Data Source to make changing hard coded ".csv" file extension to something else.
https://issues.apache.org/jira/browse/SPARK-20731

How was this patch tested?

Unit tests.

@mikkokupsu mikkokupsu changed the title SPARK-20731][SQL] Add ability to change or omit .csv file extension in CSV Data Source [SPARK-20731][SQL] Add ability to change or omit .csv file extension in CSV Data Source May 13, 2017
@gatorsmile
Copy link
Member

ok to test

cars.coalesce(1).write
.option("header", "true")
.option("fileExtension", ".tsv")
.option("delimiter", "\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When delimiter is set to \t, is it still a CSV? : )

Ref: https://en.wikipedia.org/wiki/Tab-separated_values

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have really hard time understanding what you asking here. Could you be more specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the whole API and implementation is called "CSV" still. Of course, you can already read/write files with different names and different delimiters. Does this matter enough to make a new option? what if I delimit with, eh, null characters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen and @gatorsmile, the reason why I pushed this pull request is actually to omit the file extension completely. I guess we can discuss the semantics of different delimiters and file formats but the whole point of this pull request was to give users the option to change a hard coded value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user.

I added those extentions long ago and one of the motivation was auto detection of datasource like Haddop does (which we ended up with not adding it yet due to the cost of listing files and etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason why Hive introduced the conf hive.output.file.extension?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is your usage scenario? It sounds like you want to omit the extension?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile, Yes, my use case is to omit the extension, but I decided to make the implementation flexible i.e. .option("fileExtension", "")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what is the reason you need to omit the extension?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile, We provide data files to our clients and specify the file format to TAB separated. I want to avoid all confusion where someone receives just the dataset and confuses the data to be COMMA separated.

@SparkQA
Copy link

SparkQA commented May 13, 2017

Test build #76896 has finished for PR 17973 at commit 3839cf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class SerializationStream extends Closeable
  • abstract class DeserializationStream extends Closeable
  • case class UnresolvedMapObjects(
  • case class OrderedJoin(
  • case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int])
  • case class NumericRange(min: Decimal, max: Decimal) extends Range

@gatorsmile
Copy link
Member

@mikkokupsu We do not plan to support this after an offline discussion with the other Committers. Thanks for your contribution. Could you close this PR please?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@asfgit asfgit closed this in 1a4fda8 Jul 19, 2018
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Dec 18, 2024
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Dec 20, 2024
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Dec 20, 2024
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Dec 20, 2024
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
jabbaugh pushed a commit to jabbaugh/spark that referenced this pull request Jan 9, 2025
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
dongjoon-hyun pushed a commit that referenced this pull request Jan 10, 2025
### What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

### Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request #17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

### Does this PR introduce _any_ user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

### How was this patch tested?
One unit test was added to validate a file is written with the new extension.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #49233 from jabbaugh/jbaugh-add-csv-file-ext.

Authored-by: Jim Baugh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants