[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document #20666

viirya · 2018-02-24T07:40:46Z

What changes were proposed in this pull request?

Clarify JSON and CSV reader behavior in document.

JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.

How was this patch tested?

Pass existing tests.

viirya · 2018-02-24T07:44:10Z

cc @cloud-fan @HyukjinKwon To keep CSV reader behavior for corrupted records, we don't bother to refactoring. But we should update the document and explicitly disable partial results for corrupted records.

SparkQA · 2018-02-24T08:05:02Z

Test build #87641 has finished for PR 20666 at commit 4ad330b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-24T08:23:43Z

retest this please.

HyukjinKwon · 2018-02-24T10:06:46Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   *     in an user-defined schema. If a schema does not have the field, it drops corrupt records
-   *     during parsing. When a length of parsed CSV tokens is shorter than an expected length
-   *     of a schema, it sets `null` for extra fields.</li>
+   *     during parsing. It supports partial result for the records just with less or more tokens


I think there are same instances to update DataStreamReader, readwriter.py and streaming.py too.

Yes. Will update accordingly.

SparkQA · 2018-02-24T10:51:06Z

Test build #87642 has finished for PR 20666 at commit 4ad330b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-24T14:41:20Z

Test build #87644 has finished for PR 20666 at commit 4400cf2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

HyukjinKwon · 2018-02-24T16:15:30Z

python/pyspark/sql/readwriter.py

+                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
+                  schema does not have the field, it drops corrupt records during parsing. \
+                  It supports partial result for the records just with less or more tokens \
+                  than the schema. When it meets a malformed record whose parsed tokens is \


How about a malformed record whose parsed tokens is -> a malformed record having the length of parsed tokens shorter than the length of a schema?

HyukjinKwon · 2018-02-24T16:19:57Z

python/pyspark/sql/readwriter.py

+                  fields to ``null``. To keep corrupt records, an user can set a string type \
+                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
+                  schema does not have the field, it drops corrupt records during parsing. \
+                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \


I think we should say it implicitly adds ... if a corrupted record is found while we are here? I think it only adds `columnNameOfCorruptRecord` when it meets a corrupted record during schema inference.

When users set a string type field named columnNameOfCorruptRecord in an user-defined schema, even no corrupted record, I think the field is still added. Or I misread this sentence?

Ah I thought this:

When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` field in an output schema.

describes schema inference because it adds columnNameOfCorruptRecord column if malformed record was found during schema inference. I mean ..:

scala> spark.read.json(Seq("""{"a": 1}""", """{"a":""").toDS).printSchema() root |-- _corrupt_record: string (nullable = true) |-- a: long (nullable = true) scala> spark.read.json(Seq("""{"a": 1}""").toDS).printSchema() root |-- a: long (nullable = true)

but yes I think I misread it. Here we describe things mainly about malformed records already.

HyukjinKwon · 2018-02-24T16:23:16Z

python/pyspark/sql/readwriter.py

+                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
+                  schema does not have the field, it drops corrupt records during parsing. \
+                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
+                  field in an output schema. It doesn't support partial results. Even just one \


It's trivial but how about we avoid an abbreviation like dosen't? It's usually what I do for doc although I am not sure if it actually matters.

cloud-fan · 2018-02-26T07:29:54Z

python/pyspark/sql/readwriter.py

+                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
+                  schema does not have the field, it drops corrupt records during parsing. \
+                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
+                  field in an output schema. It does not support partial results. Even just one \


I think we can drop the last sentence. The doc is pretty clear saying and sets other fields to null

cloud-fan · 2018-02-26T07:30:34Z

python/pyspark/sql/readwriter.py

-                  an expected length of a schema, it sets `null` for extra fields.
+                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
+                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
+                  fields to ``null``. To keep corrupt records, an user can set a string type \


we can't say and sets other fields to null, as it's not the case for CSV

I think this is talking about a corrupted record, not a record with less/more tokens. If CSV parser fails to parse a record, all other fields are set to null.

ah, I think we need to explain that, for CSV a record with less/more tokens is not a malformed record.

SparkQA · 2018-02-26T08:05:01Z

Test build #87663 has finished for PR 20666 at commit 1d03d3b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T04:41:26Z

Test build #87689 has finished for PR 20666 at commit 4f9b148.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-27T04:58:51Z

retest this please.

…

On Tue, Feb 27, 2018, 1:43 PM UCB AMPLab ***@***.***> wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87689/ Test FAILed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20666 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEM9_SPWZjQHnwxsJoM6rNQakwJMV1Xks5tY4fYgaJpZM4SRy8S> .

cloud-fan · 2018-02-27T05:12:16Z

retest this please

SparkQA · 2018-02-27T07:31:03Z

Test build #87699 has finished for PR 20666 at commit 4f9b148.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-27T08:31:30Z

python/pyspark/sql/readwriter.py

+                  fields to ``null``. To keep corrupt records, an user can set a string type \
+                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
+                  schema does not have the field, it drops corrupt records during parsing. \
+                  A record with less/more tokens than schema is not a corrupted record. \


.. not a corrupted record to CSV.

cloud-fan · 2018-02-27T08:35:11Z

python/pyspark/sql/readwriter.py

+                  It supports partial result for such records. When it meets a record having \
+                  the length of parsed tokens shorter than the length of a schema, it sets \
+                  ``null`` for extra fields. When a length of tokens is longer than a schema, \
+                  it drops extra tokens.


When it meets a record having fewer tokens than the length of the schema, set ``null`` to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.

cloud-fan · 2018-02-27T08:36:54Z

python/pyspark/sql/readwriter.py

+                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
+                  schema does not have the field, it drops corrupt records during parsing. \
+                  A record with less/more tokens than schema is not a corrupted record. \
+                  It supports partial result for such records. When it meets a record having \


It supports partial result for such records. this doesn't look like very useful, I think the following sentences explain this case well.

cloud-fan · 2018-02-27T08:37:22Z

python/pyspark/sql/readwriter.py

-                 ``columnNameOfCorruptRecord`` field in an output schema.
+                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
+                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
+                  fields to ``null``. It does not support partial results. To keep corrupt \


It does not support partial results. I think we don't need to mention this for json.

SparkQA · 2018-02-27T11:43:19Z

Test build #87711 has finished for PR 20666 at commit 654a59b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T12:55:04Z

Test build #87720 has finished for PR 20666 at commit fe260c9.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-27T14:31:09Z

LGTM

SparkQA · 2018-02-27T15:21:54Z

Test build #87721 has finished for PR 20666 at commit daa326d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-27T22:09:10Z

retest this please.

SparkQA · 2018-02-28T01:36:10Z

Test build #87735 has finished for PR 20666 at commit daa326d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Clarify JSON and CSV reader behavior in document. JSON doesn't support partial results for corrupted records. CSV only supports partial results for the records with more or less tokens. ## How was this patch tested? Pass existing tests. Author: Liang-Chi Hsieh <[email protected]> Closes #20666 from viirya/SPARK-23448-2. (cherry picked from commit b14993e) Signed-off-by: hyukjinkwon <[email protected]>

HyukjinKwon · 2018-02-28T02:01:45Z

Merged to master and branch-2.3.

viirya · 2018-02-28T02:18:52Z

Thanks @HyukjinKwon @cloud-fan!

yogyang · 2018-08-07T03:23:54Z

Hi, guys, I am a spark user.
I have a question for this "JSON doesn't support partial results for corrupted records." behavior.
In spark 1.6, the partial results is given, but when upgraded to 2.2, I loss some meaningful datas in my json file.

Could i get those datas come back in spark 2+? @viirya
Thanks for help.

yogyang · 2018-08-07T03:36:42Z

for specify, the json file (Sanity4.json) is
{"a":"a1","int":1,"other":4.4} {"a":"a2","int":"","other":""}

code ：

val config = new SparkConf().setMaster("local[5]").setAppName("test")
val sc = SparkContext.getOrCreate(config)
val sql = new SQLContext(sc)

val file_path = this.getClass.getClassLoader.getResource("Sanity4.json").getFile
val df = sql.read.schema(null).json(file_path)
df.show(30)

then in spark 1.6, result is
+---+----+-----+
| a| int|other|
+---+----+-----+
| a1| 1| 4.4|
| a2|null| null|
+---+----+-----+

root
|-- a: string (nullable = true)
|-- int: long (nullable = true)
|-- other: double (nullable = true)

but in spark 2.2, result is
+----+----+-----+
| a| int|other|
+----+----+-----+
| a1| 1| 4.4|
|null|null| null|
+----+----+-----+

root
|-- a: string (nullable = true)
|-- int: long (nullable = true)
|-- other: double (nullable = true)

HyukjinKwon · 2018-08-07T05:10:36Z

That's not related to this change. The issue itself seems to be a behaviour change between 1.6 and 2.x for treating empty string as null or not in double and float, which is rather a corner case and which looks, yea, an issue. Let me try to fix it while I'm here.

yogyang · 2018-08-08T11:36:20Z

Hi, Thanks for help.
And do we have follow-up stories on the data loss in spark2.2?
I have tried to use sql.read.format("my.spark.sql.execution.datasources.json").load(file_path) by shading the spark-sql_2.10:1.6.jar to my.spark.sql-shade.jar, But found the json.DefaultSource from spark1.6 can not be used in spark2.2 with error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: my.spark.sql.execution.datasources.json is not a valid Spark SQL Data Source.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
at com.test.shade.SparkTest$.main(SparkTest.scala:32)
at com.test.shade.SparkTest.main(SparkTest.scala)

So can I get some other suggestions ? @HyukjinKwon

Thanks a lot.

## What changes were proposed in this pull request? Clarify JSON and CSV reader behavior in document. JSON doesn't support partial results for corrupted records. CSV only supports partial results for the records with more or less tokens. ## How was this patch tested? Pass existing tests. Author: Liang-Chi Hsieh <[email protected]> Closes apache#20666 from viirya/SPARK-23448-2. (cherry picked from commit b14993e) Signed-off-by: hyukjinkwon <[email protected]>

Clarify JSON and CSV parser behavior.

4ad330b

HyukjinKwon reviewed Feb 24, 2018

View reviewed changes

Update other related documents.

4400cf2

HyukjinKwon approved these changes Feb 24, 2018

View reviewed changes

Address comment.

1d03d3b

cloud-fan reviewed Feb 26, 2018

View reviewed changes

Address comment.

4f9b148

Address comment.

654a59b

cloud-fan reviewed Feb 27, 2018

View reviewed changes

Address comment.

daa326d

viirya force-pushed the SPARK-23448-2 branch from fe260c9 to daa326d Compare February 27, 2018 12:59

asfgit closed this in b14993e Feb 28, 2018

viirya deleted the SPARK-23448-2 branch December 27, 2023 18:21

[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document #20666

[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document #20666

Uh oh!

Conversation

viirya commented Feb 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Feb 24, 2018

Uh oh!

SparkQA commented Feb 24, 2018

Uh oh!

viirya commented Feb 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 24, 2018

Uh oh!

SparkQA commented Feb 24, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 26, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

viirya commented Feb 27, 2018 via email

Uh oh!

cloud-fan commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

cloud-fan commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

viirya commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 28, 2018

cloud-fan Feb 27, 2018 •

edited

Loading

yogyang commented Aug 8, 2018 •

edited

Loading