Skip to content

Conversation

@falaki
Copy link
Contributor

@falaki falaki commented Jul 10, 2014

Implements RFC 4180 for parsing comma separated value files into Schema RDDs, which includes:

  • Optional header line
  • Handling quoted fields
  • Handling new lines inside quoted fields

Adds two new methods to SQLContext

  • csvFile: Takes a path and returns a SchemaRDD
  • csvRDD: Takes an RDD[String] and returns a SchemaRDD

TODO:

  • Every field is assumed to be of type String. We can either infer types (using a sample of data) or let user specify types of columns.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/consoleFull

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16490/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ever possible for "quote" to be something that's longer than 1 char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. If you look into CsvTokenizer you see I have assumed the general case (similar to delimiter). I will add unit tests with more than one character quotes to show it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the examples of quotes that are more than 1 char long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One common one is '' (two single quote characters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people actually use this?

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA results for PR 1351:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16490/consoleFull

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16490/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textFile creates an RDD of lines. This actually doesn't work if a record contains new lines (inside quotes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do this properly, we would either need a new input format that handles CSV line splits, or assemble the lines back from textFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CsvTokenizer takes an Iterator[String] and implements Iterator[Array[Any]]. It's next() may end up reading two or more lines if it needs to (e.g., when a quoted field spans multiple lines). This assumes quoted fields are not split between partitions (noted in API documentation).

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16525/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16668/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16668/consoleFull

@marmbrus
Copy link
Contributor

Can you add [SQL] to the title please?

@falaki falaki changed the title [WIP] SPARK-2360: CSV import to SchemaRDDs [WIP][SQL] SPARK-2360: CSV import to SchemaRDDs Jul 15, 2014
@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16692/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16692/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 24, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17077/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 24, 2014

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17077/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 24, 2014

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17090/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 24, 2014

QA results for PR 1351:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17090/consoleFull

@erikerlandson
Copy link
Contributor

@falaki the reading of the CSV header might be made lazy, using something similar to:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/

@marmbrus
Copy link
Contributor

@erikerlandson, make more things lazy would be great. Do you plan to contribute PromiseRDD to spark core?

@erikerlandson
Copy link
Contributor

Hi @marmbrus, I submitted PromiseRDD as part of PR #1839, which is awaiting review.

There is another somewhat different approach I used for sortByKey on PR #1689. My intuition is that using something like PromiseRDD is preferable when possible, because it respects the RDD computation model. (I do not think it is possible for sortByKey, because in that case the sampling job is required for initializing the partitions themselves)

These are under an umbrella ticket:
https://issues.apache.org/jira/browse/SPARK-2992

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use '"' here without escape

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also no space between name of argument and default value, such as header=False

@JoshRosen
Copy link
Contributor

@marmbrus commented on this on JIRA:

Hey Hossein, I'm going to close this since I think we have decided this feature would work best as a separate library using the new Data Source API.

As a result, do you mind closing this PR? Thanks!

@falaki falaki closed this Nov 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.