-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23723] New charset option for json datasource #20849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b2e92b4
cb2f27b
0d45fd3
1fb9b32
c3b04ee
93d3879
15798a1
cc05ce9
74f2026
4856b8e
084f41f
31cd793
6eacd18
3b4a509
cd1124e
ebf5390
c5b6a35
ef5e6c6
f9b6ad1
3b7714c
edb9167
5ba2881
1509e10
e3184b3
87d259c
76c1d08
88395b5
f2f8ae7
b451a03
c13c159
1cb3ac0
108e8e7
0d20cc6
54baf9f
1d50d94
bb53798
961b482
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -176,7 +176,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
| allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, | ||
| allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, | ||
| mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, | ||
| multiLine=None, allowUnquotedControlChars=None): | ||
| multiLine=None, allowUnquotedControlChars=None, charset=None): | ||
| """ | ||
| Loads JSON files and returns the results as a :class:`DataFrame`. | ||
|
|
||
|
|
@@ -237,6 +237,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
| :param allowUnquotedControlChars: allows JSON Strings to contain unquoted control | ||
| characters (ASCII characters with value less than 32, | ||
| including tab and line feed characters) or not. | ||
| :param charset: standard charset name, for example UTF-8, UTF-16 and UTF-32. If None is | ||
| set, the charset of input json will be detected automatically. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have another test case with an encoding jackson doesn't automatically detect too? |
||
|
|
||
| >>> df1 = spark.read.json('python/test_support/sql/people.json') | ||
| >>> df1.dtypes | ||
|
|
@@ -254,7 +256,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
| allowBackslashEscapingAnyCharacter=allowBackslashEscapingAnyCharacter, | ||
| mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, dateFormat=dateFormat, | ||
| timestampFormat=timestampFormat, multiLine=multiLine, | ||
| allowUnquotedControlChars=allowUnquotedControlChars) | ||
| allowUnquotedControlChars=allowUnquotedControlChars, charset=charset) | ||
| if isinstance(path, basestring): | ||
| path = [path] | ||
| if type(path) == list: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -85,6 +85,12 @@ private[sql] class JSONOptions( | |
|
|
||
| val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) | ||
|
|
||
| /** | ||
| * Standard charset name. For example UTF-8, UTF-16 and UTF-32. | ||
| * If charset is not specified (None), it will be detected automatically. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this detect the encoding for newlines too?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean the encoding of records/lines delimiter? It depends on the mode. In multiline mode, jackson is able to do that. In the case of per-line mode, Hadoop LinerRecordReader could accept delimiters in any charsets but by defaults it splits input by
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we fix that first with text datasource since schema inference in JSON is dependent on Text datasource? You are exposing incomplete option now.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Could you clarify this, please. It is not completely clear to me what do you mean.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See this 8fb2a02. It uses Text datasource when it loads lines when we infer schema. If we fix encodings with newline first, it's required to Text datasource first I believe.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Json's schema inference use the text datasource to separate the lines before we go through jackson parser where the charset for newlines should be respected. Shouldn't we better fix text datasource with the hadoop's line reader first?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A fix in hadoop line reader and this PR solve 2 different problem. Any fix in hadoop line reader will not fix the problem of wrong encoding detection. I don't understand why this PR must depend on a fix in line reader. I would say a custom record separator will solve newline problem too (https://issues.apache.org/jira/browse/SPARK-23724).
Could you tell me how this PR blocks solving the problem in Hadoop's LineReader?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Because the exposed Why don't we just fix that problem first if you plan to fix both eventually anyway?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to introduce changes step-by-step to eliminate creating of a "patch bomb". Regarding to SPARK-23724, I am going to propose this PR: MaxGekk#1 /cc @hvanhovell
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, step-by-step. I am not suggesting to fix all here. I meant SPARK-23724 should be fixed first with text data source. |
||
| */ | ||
| val charset: Option[String] = parameters.get("charset") | ||
|
|
||
| /** Sets config options on a Jackson [[JsonFactory]]. */ | ||
| def setJacksonOptions(factory: JsonFactory): Unit = { | ||
| factory.configure(JsonParser.Feature.ALLOW_COMMENTS, allowComments) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -366,6 +366,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { | |
| * `java.text.SimpleDateFormat`. This applies to timestamp type.</li> | ||
| * <li>`multiLine` (default `false`): parse one record, which may span multiple lines, | ||
| * per file</li> | ||
| * <li>`charset` (by default it is not set): allows to forcibly set one of standard basic | ||
| * or extended charsets for input jsons. For example UTF-8, UTF-16BE, UTF-32. If the charset | ||
| * is not specified (by default), the charset is detected automatically.</li> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we document it in write side too? |
||
| * </ul> | ||
| * | ||
| * @since 2.0.0 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we ues
encodingto be consistent with CSV?charsethad an aliasencodingto look after Pandas and R.