[SPARK-23723] New charset option for json datasource #20849

MaxGekk · 2018-03-17T13:33:50Z

What changes were proposed in this pull request?

I propose new option for JSON datasource which allows to specify charset of input and output files. Here is an example of using of the option:

spark.read.schema(schema)
  .option("multiline", "true")
  .option("charset", "UTF-16LE")
  .json(fileName)

If the option is not specified, charset auto-detection mechanism is used by default.

The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in UTF-8 charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like .option("charset", "UTF-16"). By default the output charset is still UTF-8 to keep backward compatibility.

How was this patch tested?

I added the following tests:

reads an json file in UTF-16 charset with BOM
read json file by using charset auto detection (UTF-32BE with BOM)
read json file using of user's charset (UTF-16LE)
saving in UTF-32BE and read the result by standard library (not by Spark)
checking that default charset is UTF-8
handling wrong (unsupported) charset

… only in the test

…er's charset

HyukjinKwon · 2018-03-17T16:38:09Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   * per file</li>
+   * <li>`charset` (by default it is not set): allows to forcibly set one of standard basic
+   * or extended charsets for input jsons. For example UTF-8, UTF-16BE, UTF-32. If the charset
+   * is not specified (by default), the charset is detected automatically.</li>


Should we document it in write side too?

HyukjinKwon · 2018-03-17T17:04:11Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+  test("json in UTF-16 with BOM") {
+    val fileName = "json-tests/utf16WithBOM.json"
+    val schema = new StructType().add("firstName", StringType).add("lastName", StringType)
+    val jsonDF = spark.read.schema(schema)


Does schema inference work when multiLine is disabled?

No because of many empty strings produced by Hadoop LineRecordReader. It will be fixed in separate PRs for the issues: SPARK-23725 and/or SPARK-23724 . For now you have to specify schema or use multiline mode as a temporary workaround.

I think you should have explained this in PR description ..

SparkQA · 2018-03-17T17:06:29Z

Test build #88341 has finished for PR 20849 at commit 961b482.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Rec(f1: String, f2: Int)

HyukjinKwon · 2018-03-17T17:20:37Z

Shall we add non-ascii compatible characters in the test resource files?

HyukjinKwon · 2018-03-18T06:29:19Z

Does charset work with newlines?

gatorsmile · 2018-03-25T23:27:33Z

@MaxGekk @HyukjinKwon What are the status of this PR?

HyukjinKwon · 2018-03-26T01:06:26Z

I am against this mainly by MaxGekk#1 (comment) if there isn't better way than rewriting it.
Also, I think we should support charset option for text datasource first too since the current option is incomplete (json's schema inference path is dependent on text datasource).

HyukjinKwon · 2018-03-26T02:12:55Z

python/pyspark/sql/readwriter.py

             allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None,
             mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None,
-             multiLine=None, allowUnquotedControlChars=None):
+             multiLine=None, allowUnquotedControlChars=None, charset=None):


Shall we ues encoding to be consistent with CSV? charset had an alias encoding to look after Pandas and R.

MaxGekk · 2018-03-27T15:41:04Z

@HyukjinKwon I am working on a PR which includes changes of this PR, recordDelimiter (flexible format) + force an user to set the recordDelimiter option if charset is specified as @cloud-fan suggested. Does it work for you?

HyukjinKwon · 2018-03-27T21:34:35Z

I think the felxible format needs more feedback and review. How about we go this way with separate PRs?

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877 to support line separator in json datasource
json datasource with encoding option (forcing lineSep)
flexible format PR with another review

MaxGekk · 2018-03-28T08:44:06Z

@HyukjinKwon

How about we go this way with separate PRs?

I agree with that only to unblock the #20849 because it solves the real problem: reading a folder with many json files in UTF-16BE (without BOM) in multiline mode. In this case, recordDelimiter (lineSep) is not required.

#20877 to support line separator in json datasource

The PR doesn't solve any practical use cases because it doesn't address Json Streaming and #20877 (comment) . Also it is useless in the case of reading jsons in charset different from UTF-8 in per-line mode without the PR: #20849 . I don't know what practical problem does it solves actually. In your tests you check those delimiters: https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112 . Are those delimiters from real jsons?

json datasource with encoding option (forcing lineSep)

encoding? Only as an alias for charset. We have been already using charset in our public release: https://docs.azuredatabricks.net/spark/latest/data-sources/read-json.html#charset-auto-detection . I will insist on the charset name for the option.

flexible format PR with another review

ok. It could come as separate PR. The flexible format just leaves the room for future extensions - nothing more. I would definitely discuss how are you going to extend lineSep in your PR: #20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR.

/cc @gatorsmile @cloud-fan @hvanhovell @rxin

HyukjinKwon · 2018-03-28T09:48:55Z

The PR doesn't solve any practical use cases

It does. It allows many workarounds, for example, we can intentionally add a custom delimiter so that it can support multiple-line-ish JSONs as are without extra parsing to make it inlined:

{
  "a": 1
}
|^|
{
  "b": 2
}

Go and google CSV's case too.

encoding? Only as an alias for charset.

Yes, encoding. This has higher priority over charset. See CSVOptions. Also, that's what we use in PySpark's CSV, doesn't it?

spark/python/pyspark/sql/readwriter.py

Line 333 in a9350d7

    
           def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None,

Shall we expose encoding and add an alias for charset?

I would definitely discuss how are you going to extend lineSep in your PR: #20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR.

Why are you dragging an orthogonal thing into #20877? I don't think we would fail to make a decision on the flexible option I guess we have much time until 2.4.0.

Even if we fail to make a decision on the flexible option, we can expose another option that supports the flexibility that forces unsetting lineSep, can't we?

Is this flexible option also a part of your public release?

MaxGekk · 2018-03-28T10:57:25Z

Shall we expose encoding and add an alias for charset?

It works for me too.

Is this flexible option also a part of your public release?

No, it is not. Only charset was exposed.

As a summary, let's merge your PR #20877 . I will prepare a PR on top of your changes, remove flexible format of lineSep + force users to set line separator if charset is specified + encoding and charset as an alias + tests for not UTF-8 lineSep. Flexible format of lineSep for text, csv and json will come as a separate PR. @HyukjinKwon does it work for you?

MaxGekk · 2018-03-28T17:38:30Z

When I was trying to remove the flexible format for lineSep (recordDelimiter), I faced to a problem. I cannot fix the test: https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081

There are no any combination of charset and lineSep that allow me to read the file. Here is the structure of the file:

BOM json_record1 delimiter json_record2 delimiter

The delimiter in hex: x0d 00 0a 00 . Basically it is \r\n in UTF-16LE. If I set:

.option("charset", "UTF-16LE").option("lineSep", "\r\n")

The first record is ignored because it contains BOM which UTF-16LE must not contain. As the result I am getting only the the second record. If I set UTF-16, I am getting the first record because it contains BOM (as UTF-16 string must contain) but the second is rejected.

How does it work in the case if .option("recordDelimiter", "x0d 00 0a 00") and charset is not specified. The answer is charset auto-detection of jackson-json. Hadoop's LineRecord Reader just splits the json by the delimiter and we have:

Seq("BOM json_record1", "json_record2")

The first string is detected according to BOM. And BOM is removed from the result by jackson. The second string is detected according to its chars as UTF-16LE. And we are getting correct result.

So, if we don't support lineSep format in which sequence of bytes are expressed explicitly, we cannot read unicode jsons with BOM in per-line mode.

MaxGekk · 2018-03-28T18:31:50Z

Ironically this file came from a customer: https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted jackson's charset auto-detection: 129fd45 . After all the changes (without lineSep in hex) we are not able to read it properly.

HyukjinKwon · 2018-03-29T10:36:08Z

Please give me few days to check your comments. I happened to be super busy for a personal reason for a while.

MaxGekk · 2018-03-29T14:55:53Z

Please, look at #20937

cloud-fan · 2018-03-30T04:46:40Z

@MaxGekk are you talking about a malformed json file which has multiple encodings inside it?

MaxGekk · 2018-03-30T10:59:34Z

@cloud-fan It is regular file in UTF-16 with BOM=0xFF 0xFE which indicates endianness - little-endian. When we slice the file by lines, the first line is still in UTF-16 with BOM, the rest lines become UTF-16LE. To read the lines using the same settings for jackson, I used charset auto-detection mechanism of the jackson library. To do so I didn't specify any charset of the input stream but after removing hexadecimal representation of lineSep I must set charset for the lineSep (\r\n or \u000d\u000a) otherwise it would be not possible to convert it to the array of byte needed by Hadoop LineReader.

In such way, if I set UTF-16, I am able to read only the first line but if I set UTF-16LE, the first line cannot be read because it contains BOM (a UTF-16LE string must not contain any BOMs).

So, the problem is the lineSep option doesn't define actual delimiter required to split input text by lines. It just defines a string which requires a charset to convert it to real delimiter (array of bytes). The hex format proposed in my first PR solves the problem.

HyukjinKwon · 2018-03-31T08:17:03Z

@MaxGekk, So to make it clear, it parses line by line correctly regardless of BOM if we set lineSep + encoding fine but it fails to parse each line as JSON via Jackson since we explicitly set UTF-16LE or UTF-16 for JSON parsing?

HyukjinKwon · 2018-03-31T08:22:40Z

From a quick look and wild guess, UTF-16 case would be alone problematic because we are going to make the delimiter with BOM bits 0xFF 0xFE 0x0D 0x00 0x0A 0x00.

HyukjinKwon · 2018-03-31T08:35:19Z

Let's make the point clear. There are two things, 1. one for line-by-line parsing and 2. JSON parsing via Jackson.

The test you pointed out looks still a bit weird because Jackson is going to detect the encoding for each line not the whole file.

MaxGekk · 2018-03-31T09:35:22Z

@HyukjinKwon I did an experiment on the MaxGekk#2 and modified the test:
If UTF-16LE is set explicitly:

val jsonDF = spark.read.schema(schema)
      .option("lineSep", "x0d 00 0a 00")
      .option("encoding", "UTF-16LE")
      .json(testFile(fileName))

only second line is returned correctly:

+---------+--------+
|firstName|lastName|
+---------+--------+
|     null|    null|
|     Doug|    Rood|
+---------+--------+

In the case of UTF-16, the first row is returned from the CSV file:

+---------+--------+
|firstName|lastName|
+---------+--------+
|    Chris|   Baird|
|     null|    null|
+---------+--------+

And you are right in the case if encoding is UTF-16, BOM is added to the delimiter:

val jsonDF = spark.read.schema(schema)
      .option("lineSep", "\r\n")
      .option("encoding", "UTF-16")
      .json(testFile(fileName))

The lineSeparator parameter of HadoopFileLinesReader is 0xFE 0xFF 0x00 0x0D 0x00 0x0A - BOM + UTF-16BE (in the CSV file BOM+UTF-16LE). Even if we cut BOM from lineSep, it will still not correct.

So, there are 2 (or 3) problems actually.

Just in case:

   val jsonDF = spark.read.schema(schema)
      .option("lineSep", "\r\n")
      .option("encoding", "UTF-16LE")
      .json(testFile(fileName))

+---------+--------+
|firstName|lastName|
+---------+--------+
|     null|    null|
|     Doug|    Rood|
+---------+--------+

HyukjinKwon · 2018-03-31T09:48:33Z

Thanks for thoughtfully testing out but I believe we can still go with #20937 if we whitelist supported encodings for now?
If that's right and I understood correctly, let's move the discussion to #20937 (comment).

HyukjinKwon · 2018-03-31T09:52:38Z

The last case seems working dependently by Jackson (UTF-16 for the first and UTF-16LE for the second line) if we don't set encoding but Jackson parses it by UTF-16LE for both if we set encoding. Did I understand correctly?

To be clear, I am not against for the flexible format yet. Just want to solve the problem bit by bit.

MaxGekk added 30 commits March 17, 2018 12:39

Test for reading json in UTF-16 with BOM

b2e92b4

Use user's charset or autodetect it if the charset is not specified

cb2f27b

Added a type and a comment for charset

0d45fd3

Replacing the monadic chaining by matching because it is more readable

1fb9b32

Keeping the old method for backward compatibility

c3b04ee

testFile is moved into the test to make more local because it is used…

93d3879

… only in the test

Adding the charset as third parameter to the text method

15798a1

Removing whitespaces at the end of the line

cc05ce9

Fix the comment in javadoc style

74f2026

Simplifying of the UTF-16 test

4856b8e

A hint to the exception how to set the charset explicitly

084f41f

Fix for scala style checks

31cd793

Run tests again

6eacd18

Improving of the exception message

3b4a509

Appended the original message to the exception

cd1124e

Multi-line reading of json file in utf-32

ebf5390

Autodetect charset of jsons in the multiline mode

c5b6a35

Test for reading a json in UTF-16LE in the multiline mode by using us…

ef5e6c6

…er's charset

Fix test: rename the test file - utf32be -> utf32BE

f9b6ad1

Fix code style

3b7714c

Appending the create verb to the method for readability

edb9167

Making the createParser as a separate private method

5ba2881

Fix code style

1509e10

Checks the charset option is supported

e3184b3

Support charset as a parameter of the json method

87d259c

Test for charset different from utf-8

76c1d08

Description of the charset option of the json method

88395b5

Minor changes in comments: added . at the end of a sentence

f2f8ae7

Added a test for wrong charset name

b451a03

Testing that charset in any case is acceptable

c13c159

HyukjinKwon reviewed Mar 17, 2018

View reviewed changes

gatorsmile mentioned this pull request Mar 25, 2018

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

Closed

HyukjinKwon reviewed Mar 26, 2018

View reviewed changes

MaxGekk closed this Mar 29, 2018

MaxGekk mentioned this pull request Mar 30, 2018

[SPARK-23723][SPARK-23724][SQL] Flexible format for the lineSep option of CSV datasource MaxGekk/spark#2

Closed

MaxGekk mentioned this pull request Apr 29, 2018

[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON datasources #21192

Closed

MaxGekk deleted the json-charset branch August 17, 2019 13:34

[SPARK-23723] New charset option for json datasource #20849

[SPARK-23723] New charset option for json datasource #20849

Uh oh!

Conversation

MaxGekk commented Mar 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Mar 17, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 17, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Mar 18, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 18, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 17, 2018

Uh oh!

HyukjinKwon commented Mar 17, 2018

Uh oh!

HyukjinKwon commented Mar 18, 2018

Uh oh!

gatorsmile commented Mar 25, 2018

Uh oh!

HyukjinKwon commented Mar 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Mar 26, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 27, 2018

Uh oh!

HyukjinKwon commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Mar 28, 2018

Uh oh!

MaxGekk commented Mar 28, 2018

Uh oh!

HyukjinKwon commented Mar 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Mar 29, 2018

Uh oh!

cloud-fan commented Mar 30, 2018

Uh oh!

MaxGekk commented Mar 30, 2018

Uh oh!

HyukjinKwon commented Mar 31, 2018

Uh oh!

HyukjinKwon commented Mar 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 31, 2018

Uh oh!

MaxGekk commented Mar 31, 2018

Uh oh!

HyukjinKwon commented Mar 31, 2018

Uh oh!

HyukjinKwon commented Mar 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

HyukjinKwon commented Mar 26, 2018 •

edited

Loading

HyukjinKwon commented Mar 27, 2018 •

edited

Loading

MaxGekk commented Mar 28, 2018 •

edited

Loading

HyukjinKwon commented Mar 28, 2018 •

edited

Loading

MaxGekk commented Mar 28, 2018 •

edited

Loading

HyukjinKwon commented Mar 29, 2018 •

edited

Loading

HyukjinKwon commented Mar 31, 2018 •

edited

Loading