-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23723] New charset option for json datasource #20849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… only in the test
| * per file</li> | ||
| * <li>`charset` (by default it is not set): allows to forcibly set one of standard basic | ||
| * or extended charsets for input jsons. For example UTF-8, UTF-16BE, UTF-32. If the charset | ||
| * is not specified (by default), the charset is detected automatically.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document it in write side too?
| test("json in UTF-16 with BOM") { | ||
| val fileName = "json-tests/utf16WithBOM.json" | ||
| val schema = new StructType().add("firstName", StringType).add("lastName", StringType) | ||
| val jsonDF = spark.read.schema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does schema inference work when multiLine is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because of many empty strings produced by Hadoop LineRecordReader. It will be fixed in separate PRs for the issues: SPARK-23725 and/or SPARK-23724 . For now you have to specify schema or use multiline mode as a temporary workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have explained this in PR description ..
|
Test build #88341 has finished for PR 20849 at commit
|
|
Shall we add non-ascii compatible characters in the test resource files? |
|
Does charset work with newlines? |
|
@MaxGekk @HyukjinKwon What are the status of this PR? |
|
I am against this mainly by MaxGekk#1 (comment) if there isn't better way than rewriting it. |
| allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, | ||
| mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, | ||
| multiLine=None, allowUnquotedControlChars=None): | ||
| multiLine=None, allowUnquotedControlChars=None, charset=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we ues encoding to be consistent with CSV? charset had an alias encoding to look after Pandas and R.
|
@HyukjinKwon I am working on a PR which includes changes of this PR, recordDelimiter (flexible format) + force an user to set the recordDelimiter option if charset is specified as @cloud-fan suggested. Does it work for you? |
|
I think the felxible format needs more feedback and review. How about we go this way with separate PRs?
|
I agree with that only to unblock the #20849 because it solves the real problem: reading a folder with many json files in UTF-16BE (without BOM) in multiline mode. In this case, recordDelimiter (lineSep) is not required.
The PR doesn't solve any practical use cases because it doesn't address Json Streaming and #20877 (comment) . Also it is useless in the case of reading jsons in charset different from UTF-8 in per-line mode without the PR: #20849 . I don't know what practical problem does it solves actually. In your tests you check those delimiters: https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112 . Are those delimiters from real jsons?
ok. It could come as separate PR. The flexible format just leaves the room for future extensions - nothing more. I would definitely discuss how are you going to extend lineSep in your PR: #20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR. |
It does. It allows many workarounds, for example, we can intentionally add a custom delimiter so that it can support multiple-line-ish JSONs as are without extra parsing to make it inlined: Go and google CSV's case too.
Yes, spark/python/pyspark/sql/readwriter.py Line 333 in a9350d7
Shall we expose
Why are you dragging an orthogonal thing into #20877? I don't think we would fail to make a decision on the flexible option I guess we have much time until 2.4.0. Even if we fail to make a decision on the flexible option, we can expose another option that supports the flexibility that forces unsetting Is this flexible option also a part of your public release? |
It works for me too.
No, it is not. Only As a summary, let's merge your PR #20877 . I will prepare a PR on top of your changes, remove flexible format of lineSep + force users to set line separator if charset is specified + |
|
When I was trying to remove the flexible format for lineSep (recordDelimiter), I faced to a problem. I cannot fix the test: https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081 There are no any combination of charset and lineSep that allow me to read the file. Here is the structure of the file: The delimiter in hex: x0d 00 0a 00 . Basically it is The first record is ignored because it contains BOM which How does it work in the case if The first string is detected according to BOM. And BOM is removed from the result by jackson. The second string is detected according to its chars as So, if we don't support lineSep format in which sequence of bytes are expressed explicitly, we cannot read unicode jsons with BOM in per-line mode. |
|
Ironically this file came from a customer: https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted jackson's charset auto-detection: 129fd45 . After all the changes (without lineSep in hex) we are not able to read it properly. |
|
Please give me few days to check your comments. I happened to be super busy for a personal reason for a while. |
|
Please, look at #20937 |
|
@MaxGekk are you talking about a malformed json file which has multiple encodings inside it? |
|
@cloud-fan It is regular file in UTF-16 with BOM= In such way, if I set So, the problem is the lineSep option doesn't define actual delimiter required to split input text by lines. It just defines a string which requires a charset to convert it to real delimiter (array of bytes). The hex format proposed in my first PR solves the problem. |
|
@MaxGekk, So to make it clear, it parses line by line correctly regardless of BOM if we set |
|
From a quick look and wild guess, |
|
Let's make the point clear. There are two things, 1. one for line-by-line parsing and 2. JSON parsing via Jackson. The test you pointed out looks still a bit weird because Jackson is going to detect the encoding for each line not the whole file. |
|
@HyukjinKwon I did an experiment on the MaxGekk#2 and modified the test: only second line is returned correctly: In the case of And you are right in the case if encoding is The So, there are 2 (or 3) problems actually. Just in case: |
|
Thanks for thoughtfully testing out but I believe we can still go with #20937 if we whitelist supported encodings for now? |
|
The last case seems working dependently by Jackson (UTF-16 for the first and UTF-16LE for the second line) if we don't set To be clear, I am not against for the flexible format yet. Just want to solve the problem bit by bit. |
What changes were proposed in this pull request?
I propose new option for JSON datasource which allows to specify charset of input and output files. Here is an example of using of the option:
If the option is not specified, charset auto-detection mechanism is used by default.
The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in UTF-8 charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like .option("charset", "UTF-16"). By default the output charset is still UTF-8 to keep backward compatibility.
How was this patch tested?
I added the following tests: