-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13425][SQL] Documentation for CSV datasource options #12817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #57463 has finished for PR 12817 at commit
|
| * You can set the following CSV-specific options to deal with CSV files: | ||
| * <li>`sep` or `delimiter` (default `,`): sets the single character as a delimiter for each | ||
| * field and value.</li> | ||
| * <li>`encoding` or `charset` (default `UTF-8`): decodes the CSV files by the given encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aliases were added for codec, charset and delimiter as compression, charset and sep.
I remember It was decided thatcodec is not documented but supported for backword compatibility because it looks not a good name.
As I thought both sep and delimiter or encoding or charset are good names, I just added but I am happy to get rid of them if any thinks not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove all the aliases? it'd be great to just have the primary one here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW a really good reference for these documentation options is R: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thank you. Let me correct them.
Also, I will add a document as soon as R csv API is added.
I tried to add the R API in #11457 but had to ask someone to do this as I am not really familiar with R.
|
Test build #57464 has finished for PR 12817 at commit
|
|
|
| * the delimiter can be part of the value.</li> | ||
| * <li>`escape` (default `\`): sets the single character used for escaping quotes inside | ||
| * an already quoted value.</li> | ||
| * <li>`comment` (default empty string): sets the single character used for skipping lines beginning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less sure that empty string is okay.
|
cc @rxin |
|
Test build #57466 has finished for PR 12817 at commit
|
| * ``escape`` (default ``\``): sets the single character used for escaping quotes \ | ||
| inside an already quoted value. | ||
| * ``header`` (default ``false``): writes the names of columns as the first line. | ||
| * ``nullValue`` (default empty string): sets the string representation of a null value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less sure that empty string is okay.
|
Test build #57467 has finished for PR 12817 at commit
|
|
Test build #57471 has finished for PR 12817 at commit
|
|
Test build #57479 has finished for PR 12817 at commit
|
| * through the entire data once, specify the schema explicitly using [[schema]]. | ||
| * | ||
| * You can set the following CSV-specific options to deal with CSV files: | ||
| * <li>`delimiter` (default `,`): sets the single character as a delimiter for each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sep is the main one isn't it?
|
Thanks - this is really close. Let's fix the minor issues and then we can merge. |
|
@rxin I am so sorry, I think I totally misunderstood your former comments before. I just addressed your latter comments and corrected them. Thank you. |
|
The scala changes lgtm. One thing I just realized -- for Python, I think we should turn those into named arguments, rather than just options. Can you do that in a separate pr? |
|
Sure. Thank you. Do you want me to remove Python documentation here for now? |
|
Don't worry about it. We can just build it on top of this (we should still document them, just better as function arguments). |
|
@rxin I see. Thank you. |
|
Test build #57491 has finished for PR 12817 at commit
|
|
Thanks - merging in master and branch-2.0. |
|
I also created a follow-up ticket for moving the options to function arguments: https://issues.apache.org/jira/browse/SPARK-15050 |
## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #12817 from HyukjinKwon/SPARK-13425. (cherry picked from commit a832cef) Signed-off-by: Reynold Xin <[email protected]>
What changes were proposed in this pull request?
This PR adds the explanation and documentation for CSV options for reading and writing.
How was this patch tested?
Style tests with
./dev/run_testsfor documentation style.