[SPARK-18352][SQL] Support parsing multiline json files #16386

NathanHowell · 2016-12-23T02:46:32Z

What changes were proposed in this pull request?

If a new option wholeFile is set to true the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.

Because the file is not buffered in memory the corrupt record handling is also slightly different when wholeFile is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.

These changes have allowed types other than String to be parsed. Support for UTF8String and Text have been added (alongside String and InputFormat) and no longer require a conversion to String just for parsing.

I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.

How was this patch tested?

New and existing unit tests. No performance or load tests have been run.

NathanHowell · 2016-12-23T03:05:15Z

Hello recent JacksonGenerator.scala commiters, please take a look.

cc/ @rxin @hvanhovell @clockfly @HyukjinKwon @cloud-fan

SparkQA · 2016-12-23T04:46:52Z

Test build #70531 has finished for PR 16386 at commit 7ad5d5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-23T05:25:16Z

python/pyspark/sql/readwriter.py

we need to add this to the end; otherwise it breaks compatibility for positional arguments.

HyukjinKwon · 2016-12-23T06:03:46Z

the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure

I am worried of changing the behaviour. I understand why it had to be done like this here as described in the description but we have input_file_name functions for these. Also, I would not expect, at least, there are file names in _corrupt_record.

We need to document this around spark.sql.columnNameOfCorruptRecord in SQLConf and columnNameOfCorruptRecord in read/writer in Python and Scala if this is acceptable.

HyukjinKwon · 2016-12-23T06:12:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

It seems this changes existing behaviour (not allowing empty schema).

Yes, it was a regression that caused a test failure.

HyukjinKwon · 2016-12-23T06:19:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

what does >: Null mean?

It states that R must be a nullable type. This enables null: R to compile and is preferable to the runtime cast null.asInstanceOf[R] because it is verified at compile time.

Yes, I said +1 because it explicitly expresses it should be nullable and I assumed (because I did not check the byte codes by myself and I might be wrong) that it gives a hint to compiler because Null is nullable (I remember I googled and played with some references for whole several days before when I was investigating another null-related issue by myself).

I am not confident enough to propose such changes in my PRs but I feel enough to say +1 at least as an expression to support this idea.

HyukjinKwon · 2016-12-23T06:29:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

I think I disagree with passing whole SparkSession because apparently we only need SQLConf or the value of spark.sql.columnNameOfCorruptRecord.

I just removed the method entirely since all it did was fetch the value of columnNameOfCorruptRecord.

srowen · 2016-12-23T11:31:36Z

Why do we need this at all? just use wholeTextFiles and parse them as JSON.

NathanHowell · 2016-12-23T18:10:22Z

@srowen It is functionally the same as what you're suggesting. The question is how (or if) it should it be first class in the DataFrameReader api. If we agree that it should be exposed, either via a new FileFormat or an option to JsonFileFormat, some abstraction is necessary to support reading from different RDD classes.

This PR just pushes that boundary a little further and let's the inference and parser code work over more types, not just String. This may make parsing more efficient in the line oriented codepath by avoiding a conversion from Text and UTF8String (in JsonToStruct) to String, and also lets us parse an InputStream without requiring all of the data to be in memory. For small files it's not likely to have a benefit (if the file is smaller than 4k it will be read entirely anyways) but as the file size increases this reduces the amount of memory required for parsing, is friendlier (in theory) on the GC and let's us consume files larger than 2GB.

SparkQA · 2016-12-27T21:35:03Z

Test build #70644 has finished for PR 16386 at commit aa5a6db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2016-12-27T21:37:46Z

@HyukjinKwon I agree that overloading the corrupt record column is undesirable and F.input_file_name is a better way to fetch the filename. It would be nice to extend this concept further and provide new functions (like F.json_exception) to retrieve exceptions and their locations, and this would work for the base case (parsing a string) as well as wholeFile. Plumbing this type of change through appears to require thread locale storage (unfortunately) but otherwise doesn't look too bad.

The question then is what to put in the corrupt record column, if one is defined, when in wholeFile mode. To retain consistency with the string paths we should really put the entire file in the column. This is problematic for large files (>2GB) since Spark SQL doesn't have blob support... so the allocations will fail (along with the task) and there is no way for the end user to work around this limitation. Functions like substr are applied to byte arrays and not file streams. Perhaps it's good enough to issue a warning (along the lines of "don't define a corrupt record column in wholeFile mode" and hope for the best?

NathanHowell · 2016-12-27T21:40:40Z

The tests failed for an unrelated reason, looks to be running out of heap space in SBT somewhere.

HyukjinKwon · 2016-12-28T00:23:34Z

Only regarding the comment, #16386 (comment), I have a similar (rather combined) idea that we provide another option such as corrupt file name optionally (meaning maybe the column appears only when user explicitly set for backwards compatibility), don't add a column by columnNameOfCorruptRecord with a proper documentation in wholeFile mode and issue a warning message if columnNameOfCorruptRecord is set by user in wholeFile mode. This is a bit complicated idea that might make users confused though. I am not sure if it is the best idea.

NathanHowell · 2016-12-29T23:07:56Z

@HyukjinKwon I just pushed a change that makes the corrupt record handling consistent: if a corrupt record column is defined it will always get the json text for failed records. If wholeFile is enabled a warning is emitted.

I think more discussion is needed to figure out the best way to handle corrupt records and exceptions, perhaps it can be shelved for now and we can pick it up later under another issue?

SparkQA · 2016-12-30T00:36:14Z

Test build #70730 has finished for PR 16386 at commit 9dc084d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2017-01-10T19:01:24Z

Jenkins, retest this please

SparkQA · 2017-01-10T21:54:55Z

Test build #71147 has finished for PR 16386 at commit 9dc084d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2017-01-24T01:41:14Z

Any other comments?

sameeragarwal · 2017-02-01T23:24:47Z

cc @gatorsmile can you please take a look too?

cloud-fan · 2017-02-06T08:22:01Z

core/src/main/scala/org/apache/spark/input/PortableDataStream.scala

there is no since tag in other methods of this class.

can we just make lazy val conf not private?

This is a public class so I thought adding a since tag would benefit the documentation. If it's not desired I can certainly remove it.

As for making the lazy val public vs private: I'm following the style used already in the class. There are public get methods for each private field. I'm not partial to either approach but prefer to keep it consistent.

SGTM, can you take a look at other public methods in this class and add since tag for them? or it looks weird that only one method has since tag...

Done, pushed in f71a465cf07fb9c043b2ccd86fa57e8e8ea9dc00

cloud-fan · 2017-02-06T08:31:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

why we need this?

Previously the JSONOptions instance was always passed around with a columnNameOfCorruptRecord value. This just makes it a field in JSONOptions instead to put all options in one place. Since it's a required option it made more sense to use a field instead making an entry in the CaseInsensitiveMap.

cloud-fan · 2017-02-06T08:39:55Z

can we focus on supporting multiline json in this PR? We can leave the improvements in new PRs, or this PR is kind of hard to review.

gatorsmile · 2017-02-06T13:42:57Z

Sorry, I missed the ping. Will review it tonight.

cloud-fan · 2017-02-08T07:11:56Z

Hi @NathanHowell , do you have time to work on it? thanks!

cloud-fan · 2017-02-08T13:20:33Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

will it be more consistent if we return ByteBuffer.wrap(getBytes) here?

It will allocate an extra object but would simplify the calling code... since it would be a short lived allocation it's probably fine to do this.

cloud-fan · 2017-02-08T13:27:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

I don't think this makes the code more readable...

This is needed to satisfy the type checker. The other approach is to explicitly specify the type in two locations: Try[java.lang.Long](...).getOrElse[java.lang.Long](...). I found explicitly boxing to be more readable than the alternative.

cloud-fan · 2017-02-08T14:49:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

why call it createBaseRddConf instead of createBaseRdd?

Habit from working with languages that don't support overloading, I'll change this

cloud-fan · 2017-02-08T14:54:18Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

looks like you need withTempPath

cloud-fan · 2017-02-08T14:58:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

can we write json string literal to text file? it's hard to understand what's going on here...

SparkQA · 2017-02-16T02:11:01Z

Test build #72975 has finished for PR 16386 at commit 7296f7e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-16T06:16:43Z

python/pyspark/sql/readwriter.py

        :param path: string represents path to the JSON dataset,
                     or RDD of Strings storing JSON objects.
        :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema.
+        :param wholeFile: parse one record, which may span multiple lines, per file. If None is


the parameters docs come with the same order of the parameter list, let's move the wholeFile doc to the end

cloud-fan · 2017-02-16T06:17:08Z

python/pyspark/sql/streaming.py

        :param path: string represents path to the JSON dataset,
                     or RDD of Strings storing JSON objects.
        :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema.
+        :param wholeFile: parse one record, which may span multiple lines, per file. If None is


cloud-fan · 2017-02-16T06:35:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   *
   * You can set the following JSON-specific options to deal with non-standard JSON files:
   * <ul>
+   * <li>`wholeFile` (default `false`): parse one record, which may span multiple lines,


please move it to the end

cloud-fan · 2017-02-16T06:36:16Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-    val columnNameOfCorruptRecord =
-      parsedOptions.columnNameOfCorruptRecord
-        .getOrElse(sparkSession.sessionState.conf.columnNameOfCorruptRecord)
+    val parsedOptions = new JSONOptions(extraOptions.toMap,


nit: the style should be

new XXX( para1, para2, para3)

cloud-fan · 2017-02-16T06:42:01Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

   * <ul>
   * <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
   * considered in every trigger.</li>
+   * <li>`wholeFile` (default `false`): parse one record, which may span multiple lines,


cloud-fan · 2017-02-16T06:43:56Z

LGTM if the test can pass. It will be good if you can also address #16386 (comment)

gatorsmile · 2017-02-16T07:49:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+      assert(jsonCopy.count === jsonDF.count)
+      val jsonCopySome = jsonCopy.selectExpr("string", "long", "boolean")
+      val jsonDFSome = jsonDF.selectExpr("string", "long", "boolean")


NathanHowell · 2017-02-17T01:49:59Z

@cloud-fan When implementing tests for the other modes I've uncovered an existing bug in schema inference in DROPMALFORMED mode: https://issues.apache.org/jira/browse/SPARK-19641. Since it is not introduced in this set of patches I will open a new pull request once this is one merged. You can inspect the fix here: NathanHowell@e233fd0

SparkQA · 2017-02-17T04:20:46Z

Test build #73029 has finished for PR 16386 at commit e323317.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-17T04:30:04Z

Test build #73030 has finished for PR 16386 at commit 58118f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-17T04:34:27Z

Test build #73032 has finished for PR 16386 at commit b801ab0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

I left some more minor comments, please address them in your next PR

cloud-fan · 2017-02-17T04:23:06Z

core/src/main/scala/org/apache/spark/input/PortableDataStream.scala

  def getPath(): String = path
+
+  @Since("2.2.0")
+  def getConfiguration: Configuration = conf


nit: we should rename it to getConf, getConfiguration is too verbose.

cloud-fan · 2017-02-17T04:23:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+      logWarning(
+        s"""Enabling wholeFile mode and defining columnNameOfCorruptRecord may result
+           |in very large allocations or OutOfMemoryExceptions being raised.
+           |


nit: unnecessary line

cloud-fan · 2017-02-17T04:24:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+  def parse[T](
+      record: T,
+      createParser: (JsonFactory, T) => JsonParser,
+      recordLiteral: T => UTF8String): Seq[InternalRow] = {


nit: recordToString?

cloud-fan · 2017-02-17T04:25:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      val path = dir.getCanonicalPath
+      primitiveFieldAndType
+        .toDF("value")
+        .write


shall we call .coalesce(1) to make sure we only write to a singe file?

cloud-fan · 2017-02-17T04:26:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      val path = dir.getCanonicalPath
+      primitiveFieldAndType
+        .toDF("value")
+        .write


cloud-fan · 2017-02-17T04:36:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+      sparkSession.sessionState.conf.sessionLocalTimeZone,
+      sparkSession.sessionState.conf.columnNameOfCorruptRecord)
+    JsonDataSource(parsedOptions).infer(
+      sparkSession, files, parsedOptions)


we can merge it into the previous line

cloud-fan · 2017-02-17T04:37:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

+          classOf[PortableDataStream],
+          conf,
+          sparkSession.sparkContext.defaultMinPartitions)
+          .setName(s"JsonFile: $name")


nit:

new XXX( ..., ...).setName....

cloud-fan · 2017-02-17T04:38:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

+          classOf[TextInputFormat],
+          classOf[LongWritable],
+          classOf[Text])
+          .setName(s"JsonLines: $name")


nit:

newAPIHadoopRDD( ..., ...).setName....

cloud-fan · 2017-02-17T04:46:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+          F.count($"dummy").as("valid"),
+          F.count($"_corrupt_record").as("corrupt"),
+          F.count("*").as("count"))
+      checkAnswer(counts, Row(1, 4, 6))


why count(*) is 6?

cloud-fan · 2017-02-17T04:48:26Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+  test("SPARK-18352: Handle multi-line corrupt documents (PERMISSIVE)") {
+    withTempPath { dir =>
+      val path = dir.getCanonicalPath
+      val corruptRecordCount = additionalCorruptRecords.count().toInt


The name is misleading, we do have a good record in this dataset, isn't it?

cloud-fan · 2017-02-17T04:51:42Z

thanks, merging to master!

gatorsmile · 2017-06-05T04:53:37Z

@NathanHowell It sounds like we also can provide multi-line support for JSON too. For example, in a single JSON file

{"a": 1,
"b": 1.1}
{"a": 2, "b": 1.1}
{"a": 3, "b": 1.1}

When using the wholeFile mode, we only parse the first Json record {"a": 1, "b": 1.1} but ignore the following records. It sounds like we should also parse them too and rename wholeFile to multiLine?

NathanHowell · 2017-06-07T05:37:23Z

Yep, should be doable without too much effort.

…

On Sun, Jun 4, 2017 at 9:54 PM, Xiao Li ***@***.***> wrote: @NathanHowell <https://github.com/nathanhowell> It sounds like we also can provide multi-line support for JSON too. For example, in a single JSON file {"a": 1, "b": 1.1} {"a": 2, "b": 1.1} {"a": 3, "b": 1.1} When using the wholeFile mode, we only parse the first Json record {"a": 1, b": 1.1} but ignore the following records. It sounds like we should also parse them too and rename wholeFile to multiLine? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16386 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKbTeYAiNRX7wtQDPt4NRvVgAbIcbSGks5sA4nvgaJpZM4LUihp> .

rxin reviewed Dec 23, 2016

View reviewed changes

python/pyspark/sql/readwriter.py Outdated

Copy link

Contributor

rxin Dec 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add this to the end; otherwise it breaks compatibility for positional arguments.

HyukjinKwon reviewed Dec 23, 2016

View reviewed changes

cloud-fan reviewed Feb 6, 2017

View reviewed changes

cloud-fan reviewed Feb 8, 2017

View reviewed changes

NathanHowell force-pushed the SPARK-18352 branch from 9dc084d to 6f8b0c3 Compare February 8, 2017 20:58

Nathan Howell added 3 commits February 15, 2017 18:00

Remove name binding

14d5f93

Repartition by value instead of luck

5f5214b

Always provide a T => UTF8String conversion function

7296f7e

NathanHowell force-pushed the SPARK-18352 branch from b0a5cc8 to 7296f7e Compare February 16, 2017 02:06

cloud-fan reviewed Feb 16, 2017

View reviewed changes

gatorsmile reviewed Feb 16, 2017

View reviewed changes

Nathan Howell added 5 commits February 16, 2017 13:24

Reorder documentation to match the function parameter order

691fa2a

Fix build break in Python due to bad rebase

a629470

Fix constructor invocation style

463062a

Check wholeFile roundtrip against all columns and the source RDD

24786d1

Add tests for FAILFAST mode

e323317

Nathan Howell added 2 commits February 16, 2017 17:59

More style fixes

58118f2

Missed one Javadoc parameter reordering

b801ab0

cloud-fan approved these changes Feb 17, 2017

View reviewed changes

asfgit closed this in 21fde57 Feb 17, 2017

NathanHowell deleted the SPARK-18352 branch February 17, 2017 05:23

[SPARK-18352][SQL] Support parsing multiline json files #16386

[SPARK-18352][SQL] Support parsing multiline json files #16386

Uh oh!

Conversation

NathanHowell commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

NathanHowell commented Dec 23, 2016

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Dec 23, 2016

Uh oh!

NathanHowell commented Dec 23, 2016

Uh oh!

SparkQA commented Dec 27, 2016

Uh oh!

NathanHowell commented Dec 27, 2016

Uh oh!

NathanHowell commented Dec 27, 2016

Uh oh!

HyukjinKwon commented Dec 28, 2016

Uh oh!

NathanHowell commented Dec 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 30, 2016

Uh oh!

NathanHowell commented Jan 10, 2017

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

NathanHowell commented Jan 24, 2017

Uh oh!

sameeragarwal commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 6, 2017

Uh oh!

gatorsmile commented Feb 6, 2017

Uh oh!

NathanHowell commented Dec 23, 2016 •

edited

Loading

HyukjinKwon commented Dec 23, 2016 •

edited

Loading

HyukjinKwon Feb 9, 2017 •

edited

Loading

NathanHowell commented Dec 29, 2016 •

edited

Loading

sameeragarwal commented Feb 1, 2017 •

edited

Loading