Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 17, 2018

What changes were proposed in this pull request?

I propose new option for JSON datasource which allows to specify charset of input and output files. Here is an example of using of the option:

spark.read.schema(schema)
  .option("multiline", "true")
  .option("charset", "UTF-16LE")
  .json(fileName)

If the option is not specified, charset auto-detection mechanism is used by default.

The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in UTF-8 charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like .option("charset", "UTF-16"). By default the output charset is still UTF-8 to keep backward compatibility.

How was this patch tested?

I added the following tests:

  • reads an json file in UTF-16 charset with BOM
  • read json file by using charset auto detection (UTF-32BE with BOM)
  • read json file using of user's charset (UTF-16LE)
  • saving in UTF-32BE and read the result by standard library (not by Spark)
  • checking that default charset is UTF-8
  • handling wrong (unsupported) charset

MaxGekk added 30 commits March 17, 2018 12:39
* per file</li>
* <li>`charset` (by default it is not set): allows to forcibly set one of standard basic
* or extended charsets for input jsons. For example UTF-8, UTF-16BE, UTF-32. If the charset
* is not specified (by default), the charset is detected automatically.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document it in write side too?

test("json in UTF-16 with BOM") {
val fileName = "json-tests/utf16WithBOM.json"
val schema = new StructType().add("firstName", StringType).add("lastName", StringType)
val jsonDF = spark.read.schema(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does schema inference work when multiLine is disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because of many empty strings produced by Hadoop LineRecordReader. It will be fixed in separate PRs for the issues: SPARK-23725 and/or SPARK-23724 . For now you have to specify schema or use multiline mode as a temporary workaround.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have explained this in PR description ..

@SparkQA
Copy link

SparkQA commented Mar 17, 2018

Test build #88341 has finished for PR 20849 at commit 961b482.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Rec(f1: String, f2: Int)

@HyukjinKwon
Copy link
Member

Shall we add non-ascii compatible characters in the test resource files?

@HyukjinKwon
Copy link
Member

Does charset work with newlines?

@gatorsmile
Copy link
Member

@MaxGekk @HyukjinKwon What are the status of this PR?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 26, 2018

I am against this mainly by MaxGekk#1 (comment) if there isn't better way than rewriting it.
Also, I think we should support charset option for text datasource first too since the current option is incomplete (json's schema inference path is dependent on text datasource).

allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None,
mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None,
multiLine=None, allowUnquotedControlChars=None):
multiLine=None, allowUnquotedControlChars=None, charset=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we ues encoding to be consistent with CSV? charset had an alias encoding to look after Pandas and R.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 27, 2018

@HyukjinKwon I am working on a PR which includes changes of this PR, recordDelimiter (flexible format) + force an user to set the recordDelimiter option if charset is specified as @cloud-fan suggested. Does it work for you?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 27, 2018

I think the felxible format needs more feedback and review. How about we go this way with separate PRs?

  1. [SPARK-23765][SQL] Supports custom line separator for json datasource #20877 to support line separator in json datasource
  2. json datasource with encoding option (forcing lineSep)
  3. flexible format PR with another review

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 28, 2018

@HyukjinKwon

How about we go this way with separate PRs?

I agree with that only to unblock the #20849 because it solves the real problem: reading a folder with many json files in UTF-16BE (without BOM) in multiline mode. In this case, recordDelimiter (lineSep) is not required.

#20877 to support line separator in json datasource

The PR doesn't solve any practical use cases because it doesn't address Json Streaming and #20877 (comment) . Also it is useless in the case of reading jsons in charset different from UTF-8 in per-line mode without the PR: #20849 . I don't know what practical problem does it solves actually. In your tests you check those delimiters: https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112 . Are those delimiters from real jsons?

json datasource with encoding option (forcing lineSep)

encoding? Only as an alias for charset. We have been already using charset in our public release: https://docs.azuredatabricks.net/spark/latest/data-sources/read-json.html#charset-auto-detection . I will insist on the charset name for the option.

flexible format PR with another review

ok. It could come as separate PR. The flexible format just leaves the room for future extensions - nothing more. I would definitely discuss how are you going to extend lineSep in your PR: #20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR.

/cc @gatorsmile @cloud-fan @hvanhovell @rxin

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 28, 2018

The PR doesn't solve any practical use cases

It does. It allows many workarounds, for example, we can intentionally add a custom delimiter so that it can support multiple-line-ish JSONs as are without extra parsing to make it inlined:

{
  "a": 1
}
|^|
{
  "b": 2
}

Go and google CSV's case too.

encoding? Only as an alias for charset.

Yes, encoding. This has higher priority over charset. See CSVOptions. Also, that's what we use in PySpark's CSV, doesn't it?

def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None,

Shall we expose encoding and add an alias for charset?

I would definitely discuss how are you going to extend lineSep in your PR: #20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR.

Why are you dragging an orthogonal thing into #20877? I don't think we would fail to make a decision on the flexible option I guess we have much time until 2.4.0.

Even if we fail to make a decision on the flexible option, we can expose another option that supports the flexibility that forces unsetting lineSep, can't we?

Is this flexible option also a part of your public release?

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 28, 2018

Shall we expose encoding and add an alias for charset?

It works for me too.

Is this flexible option also a part of your public release?

No, it is not. Only charset was exposed.

As a summary, let's merge your PR #20877 . I will prepare a PR on top of your changes, remove flexible format of lineSep + force users to set line separator if charset is specified + encoding and charset as an alias + tests for not UTF-8 lineSep. Flexible format of lineSep for text, csv and json will come as a separate PR. @HyukjinKwon does it work for you?

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 28, 2018

When I was trying to remove the flexible format for lineSep (recordDelimiter), I faced to a problem. I cannot fix the test: https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081

There are no any combination of charset and lineSep that allow me to read the file. Here is the structure of the file:

BOM json_record1 delimiter json_record2 delimiter

The delimiter in hex: x0d 00 0a 00 . Basically it is \r\n in UTF-16LE. If I set:

.option("charset", "UTF-16LE").option("lineSep", "\r\n")

The first record is ignored because it contains BOM which UTF-16LE must not contain. As the result I am getting only the the second record. If I set UTF-16, I am getting the first record because it contains BOM (as UTF-16 string must contain) but the second is rejected.

How does it work in the case if .option("recordDelimiter", "x0d 00 0a 00") and charset is not specified. The answer is charset auto-detection of jackson-json. Hadoop's LineRecord Reader just splits the json by the delimiter and we have:

Seq("BOM json_record1", "json_record2")

The first string is detected according to BOM. And BOM is removed from the result by jackson. The second string is detected according to its chars as UTF-16LE. And we are getting correct result.

So, if we don't support lineSep format in which sequence of bytes are expressed explicitly, we cannot read unicode jsons with BOM in per-line mode.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 28, 2018

Ironically this file came from a customer: https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted jackson's charset auto-detection: 129fd45 . After all the changes (without lineSep in hex) we are not able to read it properly.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 29, 2018

Please give me few days to check your comments. I happened to be super busy for a personal reason for a while.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 29, 2018

Please, look at #20937

@MaxGekk MaxGekk closed this Mar 29, 2018
@cloud-fan
Copy link
Contributor

@MaxGekk are you talking about a malformed json file which has multiple encodings inside it?

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 30, 2018

@cloud-fan It is regular file in UTF-16 with BOM=0xFF 0xFE which indicates endianness - little-endian. When we slice the file by lines, the first line is still in UTF-16 with BOM, the rest lines become UTF-16LE. To read the lines using the same settings for jackson, I used charset auto-detection mechanism of the jackson library. To do so I didn't specify any charset of the input stream but after removing hexadecimal representation of lineSep I must set charset for the lineSep (\r\n or \u000d\u000a) otherwise it would be not possible to convert it to the array of byte needed by Hadoop LineReader.

In such way, if I set UTF-16, I am able to read only the first line but if I set UTF-16LE, the first line cannot be read because it contains BOM (a UTF-16LE string must not contain any BOMs).

So, the problem is the lineSep option doesn't define actual delimiter required to split input text by lines. It just defines a string which requires a charset to convert it to real delimiter (array of bytes). The hex format proposed in my first PR solves the problem.

@HyukjinKwon
Copy link
Member

@MaxGekk, So to make it clear, it parses line by line correctly regardless of BOM if we set lineSep + encoding fine but it fails to parse each line as JSON via Jackson since we explicitly set UTF-16LE or UTF-16 for JSON parsing?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 31, 2018

From a quick look and wild guess, UTF-16 case would be alone problematic because we are going to make the delimiter with BOM bits 0xFF 0xFE 0x0D 0x00 0x0A 0x00.

@HyukjinKwon
Copy link
Member

Let's make the point clear. There are two things, 1. one for line-by-line parsing and 2. JSON parsing via Jackson.

The test you pointed out looks still a bit weird because Jackson is going to detect the encoding for each line not the whole file.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 31, 2018

@HyukjinKwon I did an experiment on the MaxGekk#2 and modified the test:
If UTF-16LE is set explicitly:

val jsonDF = spark.read.schema(schema)
      .option("lineSep", "x0d 00 0a 00")
      .option("encoding", "UTF-16LE")
      .json(testFile(fileName))

only second line is returned correctly:

+---------+--------+
|firstName|lastName|
+---------+--------+
|     null|    null|
|     Doug|    Rood|
+---------+--------+

In the case of UTF-16, the first row is returned from the CSV file:

+---------+--------+
|firstName|lastName|
+---------+--------+
|    Chris|   Baird|
|     null|    null|
+---------+--------+

And you are right in the case if encoding is UTF-16, BOM is added to the delimiter:

val jsonDF = spark.read.schema(schema)
      .option("lineSep", "\r\n")
      .option("encoding", "UTF-16")
      .json(testFile(fileName))

The lineSeparator parameter of HadoopFileLinesReader is 0xFE 0xFF 0x00 0x0D 0x00 0x0A - BOM + UTF-16BE (in the CSV file BOM+UTF-16LE). Even if we cut BOM from lineSep, it will still not correct.

So, there are 2 (or 3) problems actually.

Just in case:

   val jsonDF = spark.read.schema(schema)
      .option("lineSep", "\r\n")
      .option("encoding", "UTF-16LE")
      .json(testFile(fileName))
+---------+--------+
|firstName|lastName|
+---------+--------+
|     null|    null|
|     Doug|    Rood|
+---------+--------+

@HyukjinKwon
Copy link
Member

Thanks for thoughtfully testing out but I believe we can still go with #20937 if we whitelist supported encodings for now?
If that's right and I understood correctly, let's move the discussion to #20937 (comment).

@HyukjinKwon
Copy link
Member

The last case seems working dependently by Jackson (UTF-16 for the first and UTF-16LE for the second line) if we don't set encoding but Jackson parses it by UTF-16LE for both if we set encoding. Did I understand correctly?

To be clear, I am not against for the flexible format yet. Just want to solve the problem bit by bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants