Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 1, 2016

What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

How was this patch tested?

Style tests with ./dev/run_tests for documentation style.

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57463 has finished for PR 12817 at commit 4471e2a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* You can set the following CSV-specific options to deal with CSV files:
* <li>`sep` or `delimiter` (default `,`): sets the single character as a delimiter for each
* field and value.</li>
* <li>`encoding` or `charset` (default `UTF-8`): decodes the CSV files by the given encoding
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aliases were added for codec, charset and delimiter as compression, charset and sep.
I remember It was decided thatcodec is not documented but supported for backword compatibility because it looks not a good name.
As I thought both sep and delimiter or encoding or charset are good names, I just added but I am happy to get rid of them if any thinks not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove all the aliases? it'd be great to just have the primary one here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW a really good reference for these documentation options is R: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thank you. Let me correct them.
Also, I will add a document as soon as R csv API is added.
I tried to add the R API in #11457 but had to ask someone to do this as I am not really familiar with R.

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57464 has finished for PR 12817 at commit 52efce1.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 1, 2016

Hm.. this gives me a pass of Python style test at local. Ah, I didn't have sphinx in my local.

* the delimiter can be part of the value.</li>
* <li>`escape` (default `\`): sets the single character used for escaping quotes inside
* an already quoted value.</li>
* <li>`comment` (default empty string): sets the single character used for skipping lines beginning
Copy link
Member Author

@HyukjinKwon HyukjinKwon May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less sure that empty string is okay.

@HyukjinKwon
Copy link
Member Author

cc @rxin

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57466 has finished for PR 12817 at commit 34b52fa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 1, 2016

@rxin BTW, I found two todos, TODO: Remove this one in Spark 2.0. at DataFrameReader and DataFrameWriter added in #9945. Do you mind if I submit another PR to do this? (or I might be able to do that in this PR if this does not break any tests)

* ``escape`` (default ``\``): sets the single character used for escaping quotes \
inside an already quoted value.
* ``header`` (default ``false``): writes the names of columns as the first line.
* ``nullValue`` (default empty string): sets the string representation of a null value.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less sure that empty string is okay.

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57467 has finished for PR 12817 at commit b9aeac1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57471 has finished for PR 12817 at commit 8201f23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57479 has finished for PR 12817 at commit 54f58d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* through the entire data once, specify the schema explicitly using [[schema]].
*
* You can set the following CSV-specific options to deal with CSV files:
* <li>`delimiter` (default `,`): sets the single character as a delimiter for each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sep is the main one isn't it?

@rxin
Copy link
Contributor

rxin commented May 1, 2016

Thanks - this is really close. Let's fix the minor issues and then we can merge.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 2, 2016

@rxin I am so sorry, I think I totally misunderstood your former comments before. I just addressed your latter comments and corrected them. Thank you.

@rxin
Copy link
Contributor

rxin commented May 2, 2016

The scala changes lgtm.

One thing I just realized -- for Python, I think we should turn those into named arguments, rather than just options. Can you do that in a separate pr?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 2, 2016

Sure. Thank you. Do you want me to remove Python documentation here for now?

@rxin
Copy link
Contributor

rxin commented May 2, 2016

Don't worry about it. We can just build it on top of this (we should still document them, just better as function arguments).

@HyukjinKwon
Copy link
Member Author

@rxin I see. Thank you.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57491 has finished for PR 12817 at commit ab70b6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 2, 2016

Thanks - merging in master and branch-2.0.

@rxin
Copy link
Contributor

rxin commented May 2, 2016

I also created a follow-up ticket for moving the options to function arguments: https://issues.apache.org/jira/browse/SPARK-15050

asfgit pushed a commit that referenced this pull request May 2, 2016
## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <[email protected]>
Author: Hyukjin Kwon <[email protected]>

Closes #12817 from HyukjinKwon/SPARK-13425.

(cherry picked from commit a832cef)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in a832cef May 2, 2016
@HyukjinKwon HyukjinKwon deleted the SPARK-13425 branch January 2, 2018 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants