-
Notifications
You must be signed in to change notification settings - Fork 4
[SPARK-23724][SQL] Record delimiter for the json datasource reader #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9b7657e
51934df
a21c603
518a76a
550ac68
57ac357
4b4f6bb
21e0159
43ed51d
3081667
bb6167b
0720202
061e954
54fd42b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,7 +32,10 @@ import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl | |
| * in that file. | ||
| */ | ||
| class HadoopFileLinesReader( | ||
| file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable { | ||
| file: PartitionedFile, | ||
| conf: Configuration, | ||
| recordDelimiter: Option[Array[Byte]] = None | ||
| ) extends Iterator[Text] with Closeable { | ||
| private val iterator = { | ||
| val fileSplit = new FileSplit( | ||
| new Path(new URI(file.filePath)), | ||
|
|
@@ -42,7 +45,10 @@ class HadoopFileLinesReader( | |
| Array.empty) | ||
| val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) | ||
| val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId) | ||
| val reader = new LineRecordReader() | ||
| val reader = recordDelimiter match { | ||
| case Some(delim) => new LineRecordReader(delim) | ||
| case _ => new LineRecordReader() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem here is, we can't cover only when BTW, I am leaving all those comments because I have walked through the almost identical approach a long ago in my local and I just decided to not propose that change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One more note: this was the end of my try. I just gave up with a conclusion that we need a complete rewrite of
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about we force users to provide a customer recordDelimiter if charset is specified? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe arguably the more common case is a case when
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does CSV solve this problem? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CSV has the same issue. That option is incomplete. So, I believe we currently support ascii-compatible encodings only in CSV. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, see: So, I was thinking we should rather deprecate this option in CSV. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had an offline discussion with @cloud-fan about #1 (comment). I am fine with that. There seems at least no hole. |
||
| } | ||
| reader.initialize(fileSplit, hadoopAttemptContext) | ||
| new RecordReaderIterator(reader) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird to assume the data is in the first column of the input row. How about