[SPARK-13425][SQL] Documentation for CSV datasource options #12817

HyukjinKwon · 2016-05-01T03:33:49Z

What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

How was this patch tested?

Style tests with ./dev/run_tests for documentation style.

SparkQA · 2016-05-01T03:39:53Z

Test build #57463 has finished for PR 12817 at commit 4471e2a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-01T03:40:26Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+   * You can set the following CSV-specific options to deal with CSV files:
+   * <li>`sep` or `delimiter` (default `,`): sets the single character as a delimiter for each
+   * field and value.</li>
+   * <li>`encoding` or `charset` (default `UTF-8`): decodes the CSV files by the given encoding


Aliases were added for codec, charset and delimiter as compression, charset and sep.
I remember It was decided thatcodec is not documented but supported for backword compatibility because it looks not a good name.
As I thought both sep and delimiter or encoding or charset are good names, I just added but I am happy to get rid of them if any thinks not.

can we remove all the aliases? it'd be great to just have the primary one here.

BTW a really good reference for these documentation options is R: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Sure. Thank you. Let me correct them.
Also, I will add a document as soon as R csv API is added.
I tried to add the R API in #11457 but had to ask someone to do this as I am not really familiar with R.

SparkQA · 2016-05-01T03:49:55Z

Test build #57464 has finished for PR 12817 at commit 52efce1.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-01T03:52:18Z

~~Hm.. this gives me a pass of Python style test at local.~~ Ah, I didn't have sphinx in my local.

HyukjinKwon · 2016-05-01T04:08:43Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+   * the delimiter can be part of the value.</li>
+   * <li>`escape` (default `\`): sets the single character used for escaping quotes inside
+   * an already quoted value.</li>
+   * <li>`comment` (default empty string): sets the single character used for skipping lines beginning


I am less sure that empty string is okay.

HyukjinKwon · 2016-05-01T04:09:11Z

cc @rxin

SparkQA · 2016-05-01T04:13:40Z

Test build #57466 has finished for PR 12817 at commit 34b52fa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-01T04:21:17Z

@rxin BTW, I found two todos, TODO: Remove this one in Spark 2.0. at DataFrameReader and DataFrameWriter added in #9945. Do you mind if I submit another PR to do this? (or I might be able to do that in this PR if this does not break any tests)

HyukjinKwon · 2016-05-01T04:36:46Z

python/pyspark/sql/readwriter.py

+            * ``escape`` (default ``\``): sets the single character used for escaping quotes \
+                inside an already quoted value.
+            * ``header`` (default ``false``): writes the names of columns as the first line.
+            * ``nullValue`` (default empty string): sets the string representation of a null value.


I am less sure that empty string is okay.

SparkQA · 2016-05-01T05:35:07Z

Test build #57467 has finished for PR 12817 at commit b9aeac1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-01T08:28:49Z

Test build #57471 has finished for PR 12817 at commit 8201f23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-01T13:08:51Z

Test build #57479 has finished for PR 12817 at commit 54f58d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-01T18:58:30Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   * through the entire data once, specify the schema explicitly using [[schema]].
   *
+   * You can set the following CSV-specific options to deal with CSV files:
+   * <li>`delimiter` (default `,`): sets the single character as a delimiter for each


sep is the main one isn't it?

rxin · 2016-05-01T19:01:50Z

Thanks - this is really close. Let's fix the minor issues and then we can merge.

HyukjinKwon · 2016-05-02T00:33:36Z

@rxin I am so sorry, I think I totally misunderstood your former comments before. I just addressed your latter comments and corrected them. Thank you.

rxin · 2016-05-02T01:28:54Z

The scala changes lgtm.

One thing I just realized -- for Python, I think we should turn those into named arguments, rather than just options. Can you do that in a separate pr?

HyukjinKwon · 2016-05-02T01:32:49Z

Sure. Thank you. Do you want me to remove Python documentation here for now?

rxin · 2016-05-02T01:36:28Z

Don't worry about it. We can just build it on top of this (we should still document them, just better as function arguments).

HyukjinKwon · 2016-05-02T01:38:15Z

@rxin I see. Thank you.

SparkQA · 2016-05-02T01:59:46Z

Test build #57491 has finished for PR 12817 at commit ab70b6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-02T02:03:15Z

Thanks - merging in master and branch-2.0.

rxin · 2016-05-02T02:04:16Z

I also created a follow-up ticket for moving the options to function arguments: https://issues.apache.org/jira/browse/SPARK-15050

## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #12817 from HyukjinKwon/SPARK-13425. (cherry picked from commit a832cef) Signed-off-by: Reynold Xin <[email protected]>

Add CSV documentation

4471e2a

HyukjinKwon reviewed May 1, 2016
View reviewed changes

Fetch upstream

52efce1

Replace `` to "empty string".

34b52fa

HyukjinKwon reviewed May 1, 2016
View reviewed changes

Add omitted opening tag ` and max length

b9aeac1

HyukjinKwon reviewed May 1, 2016
View reviewed changes

Remove alias for delimiter and charset.

8201f23

Add starting tag for option, delimiter.

54f58d3

rxin reviewed May 1, 2016
View reviewed changes

Address comments

ab70b6d

asfgit closed this in a832cef May 2, 2016

HyukjinKwon deleted the SPARK-13425 branch January 2, 2018 03:40

[SPARK-13425][SQL] Documentation for CSV datasource options #12817

[SPARK-13425][SQL] Documentation for CSV datasource options #12817

Uh oh!

Conversation

HyukjinKwon commented May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 1, 2016

Uh oh!

HyukjinKwon May 1, 2016

Choose a reason for hiding this comment

Uh oh!

rxin May 1, 2016

Choose a reason for hiding this comment

Uh oh!

rxin May 1, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 1, 2016

Uh oh!

HyukjinKwon commented May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 1, 2016

Uh oh!

SparkQA commented May 1, 2016

Uh oh!

HyukjinKwon commented May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon May 1, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 1, 2016

Uh oh!

SparkQA commented May 1, 2016

Uh oh!

SparkQA commented May 1, 2016

Uh oh!

rxin May 1, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented May 1, 2016

Uh oh!

HyukjinKwon commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented May 2, 2016

Uh oh!

HyukjinKwon commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented May 2, 2016

Uh oh!

HyukjinKwon commented May 2, 2016

Uh oh!

SparkQA commented May 2, 2016

Uh oh!

rxin commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented May 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

HyukjinKwon commented May 1, 2016 •

edited

Loading

HyukjinKwon May 1, 2016 •

edited

Loading

HyukjinKwon commented May 1, 2016 •

edited

Loading

HyukjinKwon May 1, 2016 •

edited

Loading

HyukjinKwon commented May 1, 2016 •

edited

Loading

HyukjinKwon commented May 2, 2016 •

edited

Loading

HyukjinKwon commented May 2, 2016 •

edited

Loading

rxin commented May 2, 2016 •

edited

Loading