-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SQL] SPARK-2360: CSV import to SchemaRDDs #1351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
QA tests have started for PR 1351. This patch merges cleanly. |
|
QA results for PR 1351: |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16487/ |
|
Merged build triggered. |
|
Merged build started. |
|
QA tests have started for PR 1351. This patch merges cleanly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ever possible for "quote" to be something that's longer than 1 char?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. If you look into CsvTokenizer you see I have assumed the general case (similar to delimiter). I will add unit tests with more than one character quotes to show it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the examples of quotes that are more than 1 char long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One common one is '' (two single quote characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do people actually use this?
|
QA results for PR 1351: |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
textFile creates an RDD of lines. This actually doesn't work if a record contains new lines (inside quotes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this properly, we would either need a new input format that handles CSV line splits, or assemble the lines back from textFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CsvTokenizer takes an Iterator[String] and implements Iterator[Array[Any]]. It's next() may end up reading two or more lines if it needs to (e.g., when a quoted field spans multiple lines). This assumes quoted fields are not split between partitions (noted in API documentation).
|
QA tests have started for PR 1351. This patch merges cleanly. |
|
QA tests have started for PR 1351. This patch merges cleanly. |
|
QA results for PR 1351: |
|
Can you add [SQL] to the title please? |
|
QA tests have started for PR 1351. This patch merges cleanly. |
|
QA results for PR 1351: |
|
QA tests have started for PR 1351. This patch merges cleanly. |
|
QA results for PR 1351: |
|
QA tests have started for PR 1351. This patch merges cleanly. |
|
QA results for PR 1351: |
|
@falaki the reading of the CSV header might be made lazy, using something similar to: |
|
@erikerlandson, make more things lazy would be great. Do you plan to contribute PromiseRDD to spark core? |
|
Hi @marmbrus, I submitted PromiseRDD as part of PR #1839, which is awaiting review. There is another somewhat different approach I used for sortByKey on PR #1689. My intuition is that using something like PromiseRDD is preferable when possible, because it respects the RDD computation model. (I do not think it is possible for sortByKey, because in that case the sampling job is required for initializing the partitions themselves) These are under an umbrella ticket: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use '"' here without escape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also no space between name of argument and default value, such as header=False
Implements RFC 4180 for parsing comma separated value files into Schema RDDs, which includes:
Adds two new methods to SQLContext
TODO: