-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18352][SQL] Support parsing multiline json files #16386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello recent JacksonGenerator.scala commiters, please take a look. |
|
Test build #70531 has finished for PR 16386 at commit
|
python/pyspark/sql/readwriter.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add this to the end; otherwise it breaks compatibility for positional arguments.
I am worried of changing the behaviour. I understand why it had to be done like this here as described in the description but we have We need to document this around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this changes existing behaviour (not allowing empty schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was a regression that caused a test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does >: Null mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It states that R must be a nullable type. This enables null: R to compile and is preferable to the runtime cast null.asInstanceOf[R] because it is verified at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I said +1 because it explicitly expresses it should be nullable and I assumed (because I did not check the byte codes by myself and I might be wrong) that it gives a hint to compiler because Null is nullable (I remember I googled and played with some references for whole several days before when I was investigating another null-related issue by myself).
I am not confident enough to propose such changes in my PRs but I feel enough to say +1 at least as an expression to support this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I disagree with passing whole SparkSession because apparently we only need SQLConf or the value of spark.sql.columnNameOfCorruptRecord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the method entirely since all it did was fetch the value of columnNameOfCorruptRecord.
|
Why do we need this at all? just use |
|
@srowen It is functionally the same as what you're suggesting. The question is how (or if) it should it be first class in the This PR just pushes that boundary a little further and let's the inference and parser code work over more types, not just |
|
Test build #70644 has finished for PR 16386 at commit
|
|
@HyukjinKwon I agree that overloading the corrupt record column is undesirable and The question then is what to put in the corrupt record column, if one is defined, when in |
|
The tests failed for an unrelated reason, looks to be running out of heap space in SBT somewhere. |
|
Only regarding the comment, #16386 (comment), I have a similar (rather combined) idea that we provide another option such as corrupt file name optionally (meaning maybe the column appears only when user explicitly set for backwards compatibility), don't add a column by |
|
@HyukjinKwon I just pushed a change that makes the corrupt record handling consistent: if a corrupt record column is defined it will always get the json text for failed records. If I think more discussion is needed to figure out the best way to handle corrupt records and exceptions, perhaps it can be shelved for now and we can pick it up later under another issue? |
|
Test build #70730 has finished for PR 16386 at commit
|
|
Jenkins, retest this please |
|
Test build #71147 has finished for PR 16386 at commit
|
|
Any other comments? |
|
cc @gatorsmile can you please take a look too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no since tag in other methods of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just make lazy val conf not private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a public class so I thought adding a since tag would benefit the documentation. If it's not desired I can certainly remove it.
As for making the lazy val public vs private: I'm following the style used already in the class. There are public get methods for each private field. I'm not partial to either approach but prefer to keep it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, can you take a look at other public methods in this class and add since tag for them? or it looks weird that only one method has since tag...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, pushed in f71a465cf07fb9c043b2ccd86fa57e8e8ea9dc00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the JSONOptions instance was always passed around with a columnNameOfCorruptRecord value. This just makes it a field in JSONOptions instead to put all options in one place. Since it's a required option it made more sense to use a field instead making an entry in the CaseInsensitiveMap.
|
can we focus on supporting multiline json in this PR? We can leave the improvements in new PRs, or this PR is kind of hard to review. |
|
Sorry, I missed the ping. Will review it tonight. |
|
Hi @NathanHowell , do you have time to work on it? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it be more consistent if we return ByteBuffer.wrap(getBytes) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will allocate an extra object but would simplify the calling code... since it would be a short lived allocation it's probably fine to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this makes the code more readable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to satisfy the type checker. The other approach is to explicitly specify the type in two locations: Try[java.lang.Long](...).getOrElse[java.lang.Long](...). I found explicitly boxing to be more readable than the alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why call it createBaseRddConf instead of createBaseRdd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Habit from working with languages that don't support overloading, I'll change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you need withTempPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we write json string literal to text file? it's hard to understand what's going on here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
9dc084d to
6f8b0c3
Compare
b0a5cc8 to
7296f7e
Compare
|
Test build #72975 has finished for PR 16386 at commit
|
python/pyspark/sql/readwriter.py
Outdated
| :param path: string represents path to the JSON dataset, | ||
| or RDD of Strings storing JSON objects. | ||
| :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema. | ||
| :param wholeFile: parse one record, which may span multiple lines, per file. If None is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parameters docs come with the same order of the parameter list, let's move the wholeFile doc to the end
python/pyspark/sql/streaming.py
Outdated
| :param path: string represents path to the JSON dataset, | ||
| or RDD of Strings storing JSON objects. | ||
| :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema. | ||
| :param wholeFile: parse one record, which may span multiple lines, per file. If None is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| * | ||
| * You can set the following JSON-specific options to deal with non-standard JSON files: | ||
| * <ul> | ||
| * <li>`wholeFile` (default `false`): parse one record, which may span multiple lines, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move it to the end
| val columnNameOfCorruptRecord = | ||
| parsedOptions.columnNameOfCorruptRecord | ||
| .getOrElse(sparkSession.sessionState.conf.columnNameOfCorruptRecord) | ||
| val parsedOptions = new JSONOptions(extraOptions.toMap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the style should be
new XXX(
para1,
para2,
para3)
| * <ul> | ||
| * <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be | ||
| * considered in every trigger.</li> | ||
| * <li>`wholeFile` (default `false`): parse one record, which may span multiple lines, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
LGTM if the test can pass. It will be good if you can also address #16386 (comment) |
|
|
||
| assert(jsonCopy.count === jsonDF.count) | ||
| val jsonCopySome = jsonCopy.selectExpr("string", "long", "boolean") | ||
| val jsonDFSome = jsonDF.selectExpr("string", "long", "boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it only covers three columns.
root
|-- bigInteger: decimal(20,0) (nullable = true)
|-- boolean: boolean (nullable = true)
|-- double: double (nullable = true)
|-- integer: long (nullable = true)
|-- long: long (nullable = true)
|-- null: string (nullable = true)
|-- string: string (nullable = true)
root
|-- bigInteger: decimal(20,0) (nullable = true)
|-- boolean: boolean (nullable = true)
|-- double: double (nullable = true)
|-- integer: long (nullable = true)
|-- long: long (nullable = true)
|-- string: string (nullable = true)
|
@cloud-fan When implementing tests for the other modes I've uncovered an existing bug in schema inference in |
|
Test build #73029 has finished for PR 16386 at commit
|
|
Test build #73030 has finished for PR 16386 at commit
|
|
Test build #73032 has finished for PR 16386 at commit
|
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some more minor comments, please address them in your next PR
| def getPath(): String = path | ||
|
|
||
| @Since("2.2.0") | ||
| def getConfiguration: Configuration = conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should rename it to getConf, getConfiguration is too verbose.
| logWarning( | ||
| s"""Enabling wholeFile mode and defining columnNameOfCorruptRecord may result | ||
| |in very large allocations or OutOfMemoryExceptions being raised. | ||
| | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary line
| def parse[T]( | ||
| record: T, | ||
| createParser: (JsonFactory, T) => JsonParser, | ||
| recordLiteral: T => UTF8String): Seq[InternalRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: recordToString?
| val path = dir.getCanonicalPath | ||
| primitiveFieldAndType | ||
| .toDF("value") | ||
| .write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we call .coalesce(1) to make sure we only write to a singe file?
| val path = dir.getCanonicalPath | ||
| primitiveFieldAndType | ||
| .toDF("value") | ||
| .write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| sparkSession.sessionState.conf.sessionLocalTimeZone, | ||
| sparkSession.sessionState.conf.columnNameOfCorruptRecord) | ||
| JsonDataSource(parsedOptions).infer( | ||
| sparkSession, files, parsedOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can merge it into the previous line
| classOf[PortableDataStream], | ||
| conf, | ||
| sparkSession.sparkContext.defaultMinPartitions) | ||
| .setName(s"JsonFile: $name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
new XXX(
...,
...).setName....
| classOf[TextInputFormat], | ||
| classOf[LongWritable], | ||
| classOf[Text]) | ||
| .setName(s"JsonLines: $name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
newAPIHadoopRDD(
...,
...).setName....
| F.count($"dummy").as("valid"), | ||
| F.count($"_corrupt_record").as("corrupt"), | ||
| F.count("*").as("count")) | ||
| checkAnswer(counts, Row(1, 4, 6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why count(*) is 6?
| test("SPARK-18352: Handle multi-line corrupt documents (PERMISSIVE)") { | ||
| withTempPath { dir => | ||
| val path = dir.getCanonicalPath | ||
| val corruptRecordCount = additionalCorruptRecords.count().toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is misleading, we do have a good record in this dataset, isn't it?
|
thanks, merging to master! |
|
@NathanHowell It sounds like we also can provide multi-line support for JSON too. For example, in a single JSON file When using the |
|
Yep, should be doable without too much effort.
…On Sun, Jun 4, 2017 at 9:54 PM, Xiao Li ***@***.***> wrote:
@NathanHowell <https://github.com/nathanhowell> It sounds like we also
can provide multi-line support for JSON too. For example, in a single JSON
file
{"a": 1,
"b": 1.1}
{"a": 2, "b": 1.1}
{"a": 3, "b": 1.1}
When using the wholeFile mode, we only parse the first Json record {"a":
1, b": 1.1} but ignore the following records. It sounds like we should
also parse them too and rename wholeFile to multiLine?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16386 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAKbTeYAiNRX7wtQDPt4NRvVgAbIcbSGks5sA4nvgaJpZM4LUihp>
.
|
What changes were proposed in this pull request?
If a new option
wholeFileis set totruethe JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.Because the file is not buffered in memory the corrupt record handling is also slightly different when
wholeFileis enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.These changes have allowed types other than
Stringto be parsed. Support forUTF8StringandTexthave been added (alongsideStringandInputFormat) and no longer require a conversion toStringjust for parsing.I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.
How was this patch tested?
New and existing unit tests. No performance or load tests have been run.