[WIP][SQL] SPARK-2360: CSV import to SchemaRDDs #1351

falaki · 2014-07-10T02:33:29Z

Implements RFC 4180 for parsing comma separated value files into Schema RDDs, which includes:

Optional header line
Handling quoted fields
Handling new lines inside quoted fields

Adds two new methods to SQLContext

csvFile: Takes a path and returns a SchemaRDD
csvRDD: Takes an RDD[String] and returns a SchemaRDD

TODO:

Every field is assumed to be of type String. We can either infer types (using a sample of data) or let user specify types of columns.

AmplabJenkins · 2014-07-10T02:36:14Z

Merged build triggered.

AmplabJenkins · 2014-07-10T02:36:21Z

Merged build started.

SparkQA · 2014-07-10T02:37:40Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/consoleFull

SparkQA · 2014-07-10T02:39:28Z

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/consoleFull

AmplabJenkins · 2014-07-10T02:39:33Z

Merged build finished.

AmplabJenkins · 2014-07-10T02:39:33Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/

AmplabJenkins · 2014-07-10T06:11:13Z

Merged build triggered.

AmplabJenkins · 2014-07-10T06:11:22Z

Merged build started.

SparkQA · 2014-07-10T06:12:30Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16490/consoleFull

rxin · 2014-07-10T06:36:54Z

sql/core/src/main/scala/org/apache/spark/sql/csv/CsvRDD.scala

is it ever possible for "quote" to be something that's longer than 1 char?

It is. If you look into CsvTokenizer you see I have assumed the general case (similar to delimiter). I will add unit tests with more than one character quotes to show it.

What are the examples of quotes that are more than 1 char long?

One common one is '' (two single quote characters)

Do people actually use this?

SparkQA · 2014-07-10T07:50:35Z

QA results for PR 1351:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16490/consoleFull

AmplabJenkins · 2014-07-10T07:50:40Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-10T07:50:41Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16490/

rxin · 2014-07-10T07:54:21Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

textFile creates an RDD of lines. This actually doesn't work if a record contains new lines (inside quotes).

To do this properly, we would either need a new input format that handles CSV line splits, or assemble the lines back from textFile.

CsvTokenizer takes an Iterator[String] and implements Iterator[Array[Any]]. It's next() may end up reading two or more lines if it needs to (e.g., when a quoted field spans multiple lines). This assumes quoted fields are not split between partitions (noted in API documentation).

SparkQA · 2014-07-10T21:32:49Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16525/consoleFull

SparkQA · 2014-07-15T07:42:51Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16668/consoleFull

SparkQA · 2014-07-15T09:13:35Z

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16668/consoleFull

marmbrus · 2014-07-15T18:39:45Z

Can you add [SQL] to the title please?

SparkQA · 2014-07-15T20:17:59Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16692/consoleFull

SparkQA · 2014-07-15T21:57:20Z

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16692/consoleFull

SparkQA · 2014-07-24T00:53:26Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17077/consoleFull

SparkQA · 2014-07-24T02:31:20Z

QA results for PR 1351:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17077/consoleFull

SparkQA · 2014-07-24T03:03:30Z

QA tests have started for PR 1351. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17090/consoleFull

SparkQA · 2014-07-24T04:44:38Z

QA results for PR 1351:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17090/consoleFull

erikerlandson · 2014-08-22T20:16:29Z

@falaki the reading of the CSV header might be made lazy, using something similar to:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/

marmbrus · 2014-08-25T18:20:37Z

@erikerlandson, make more things lazy would be great. Do you plan to contribute PromiseRDD to spark core?

erikerlandson · 2014-08-25T18:44:23Z

Hi @marmbrus, I submitted PromiseRDD as part of PR #1839, which is awaiting review.

There is another somewhat different approach I used for sortByKey on PR #1689. My intuition is that using something like PromiseRDD is preferable when possible, because it respects the RDD computation model. (I do not think it is possible for sortByKey, because in that case the sampling job is required for initializing the partitions themselves)

These are under an umbrella ticket:
https://issues.apache.org/jira/browse/SPARK-2992

davies · 2014-09-05T21:37:58Z

python/pyspark/sql.py

you could use '"' here without escape

Also no space between name of argument and default value, such as header=False

JoshRosen · 2014-11-07T22:13:47Z

@marmbrus commented on this on JIRA:

Hey Hossein, I'm going to close this since I think we have decided this feature would work best as a separate library using the new Data Source API.

As a result, do you mind closing this PR? Thanks!

falaki added 6 commits July 7, 2014 19:12

Basic version of csv parsing

177eb06

RFC 4180 compatible tokenizer

510df2e

Added API documentation

30a5ae5

Added unit tests

ac95fcb

Organized imports

95a7a1a

Style cleanup

65f7e95

falaki added 2 commits July 9, 2014 23:08

Added Java API

b5eae31

Style

44fe059

rxin reviewed Jul 10, 2014
View reviewed changes

falaki added 2 commits July 10, 2014 14:28

Added python bindings

70b6018

Applied style comments

1409e44

Updating tests

7d89e5e

falaki changed the title ~~[WIP] SPARK-2360: CSV import to SchemaRDDs~~ [WIP][SQL] SPARK-2360: CSV import to SchemaRDDs Jul 15, 2014

Fixed python test

143bfc1

falaki added 3 commits July 23, 2014 15:26

Merge branch 'master' into csv

05f5089

Using option for schema

6a2487b

Overloaded methods

829a5af

Fixed python test

11e6422

chutium mentioned this pull request Jul 30, 2014

[SPARK-2729] [SQL] Forgot to match Timestamp type in ColumnBuilder #1636

Closed

marmbrus mentioned this pull request Aug 25, 2014

[SPARK-3205] add EscapedTextInputFormat #2118

Closed

quasiben mentioned this pull request Sep 4, 2014

CSV Headers with Spark blaze/blaze#593

Closed

davies reviewed Sep 5, 2014
View reviewed changes

falaki closed this Nov 11, 2014

[WIP][SQL] SPARK-2360: CSV import to SchemaRDDs #1351

[WIP][SQL] SPARK-2360: CSV import to SchemaRDDs #1351

Uh oh!

Conversation

falaki commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 15, 2014

Uh oh!

SparkQA commented Jul 15, 2014

Uh oh!

marmbrus commented Jul 15, 2014

Uh oh!

SparkQA commented Jul 15, 2014

Uh oh!

SparkQA commented Jul 15, 2014

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

erikerlandson commented Aug 22, 2014

Uh oh!

marmbrus commented Aug 25, 2014

Uh oh!

erikerlandson commented Aug 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Nov 7, 2014

Uh oh!

Reviewers

Assignees