Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Feb 24, 2018

What changes were proposed in this pull request?

Clarify JSON and CSV reader behavior in document.

JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.

How was this patch tested?

Pass existing tests.

@viirya
Copy link
Member Author

viirya commented Feb 24, 2018

cc @cloud-fan @HyukjinKwon To keep CSV reader behavior for corrupted records, we don't bother to refactoring. But we should update the document and explicitly disable partial results for corrupted records.

@SparkQA
Copy link

SparkQA commented Feb 24, 2018

Test build #87641 has finished for PR 20666 at commit 4ad330b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 24, 2018

retest this please.

* in an user-defined schema. If a schema does not have the field, it drops corrupt records
* during parsing. When a length of parsed CSV tokens is shorter than an expected length
* of a schema, it sets `null` for extra fields.</li>
* during parsing. It supports partial result for the records just with less or more tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are same instances to update DataStreamReader, readwriter.py and streaming.py too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will update accordingly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2018

Test build #87642 has finished for PR 20666 at commit 4ad330b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2018

Test build #87644 has finished for PR 20666 at commit 4400cf2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
It supports partial result for the records just with less or more tokens \
than the schema. When it meets a malformed record whose parsed tokens is \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a malformed record whose parsed tokens is -> a malformed record having the length of parsed tokens shorter than the length of a schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

fields to ``null``. To keep corrupt records, an user can set a string type \
field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should say it implicitly adds ... if a corrupted record is found while we are here? I think it only adds `columnNameOfCorruptRecord` when it meets a corrupted record during schema inference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users set a string type field named columnNameOfCorruptRecord in an user-defined schema, even no corrupted record, I think the field is still added. Or I misread this sentence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I thought this:

When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` field in an output schema.

describes schema inference because it adds columnNameOfCorruptRecord column if malformed record was found during schema inference. I mean ..:

scala> spark.read.json(Seq("""{"a": 1}""", """{"a":""").toDS).printSchema()
root
 |-- _corrupt_record: string (nullable = true)
 |-- a: long (nullable = true)


scala> spark.read.json(Seq("""{"a": 1}""").toDS).printSchema()
root
 |-- a: long (nullable = true)

but yes I think I misread it. Here we describe things mainly about malformed records already.

field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
field in an output schema. It doesn't support partial results. Even just one \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trivial but how about we avoid an abbreviation like dosen't? It's usually what I do for doc although I am not sure if it actually matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
field in an output schema. It does not support partial results. Even just one \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the last sentence. The doc is pretty clear saying and sets other fields to null

an expected length of a schema, it sets `null` for extra fields.
* ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
into a field configured by ``columnNameOfCorruptRecord``, and sets other \
fields to ``null``. To keep corrupt records, an user can set a string type \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't say and sets other fields to null, as it's not the case for CSV

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is talking about a corrupted record, not a record with less/more tokens. If CSV parser fails to parse a record, all other fields are set to null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think we need to explain that, for CSV a record with less/more tokens is not a malformed record.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Added.

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Test build #87663 has finished for PR 20666 at commit 1d03d3b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87689 has finished for PR 20666 at commit 4f9b148.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 27, 2018 via email

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87699 has finished for PR 20666 at commit 4f9b148.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

fields to ``null``. To keep corrupt records, an user can set a string type \
field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
A record with less/more tokens than schema is not a corrupted record. \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. not a corrupted record to CSV.

It supports partial result for such records. When it meets a record having \
the length of parsed tokens shorter than the length of a schema, it sets \
``null`` for extra fields. When a length of tokens is longer than a schema, \
it drops extra tokens.
Copy link
Contributor

@cloud-fan cloud-fan Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it meets a record having fewer tokens than the length of the schema, set ``null`` to extra fields.
When the record has more tokens than the length of the schema, it drops extra tokens.

field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
A record with less/more tokens than schema is not a corrupted record. \
It supports partial result for such records. When it meets a record having \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It supports partial result for such records. this doesn't look like very useful, I think the following sentences explain this case well.

``columnNameOfCorruptRecord`` field in an output schema.
* ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
into a field configured by ``columnNameOfCorruptRecord``, and sets other \
fields to ``null``. It does not support partial results. To keep corrupt \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not support partial results. I think we don't need to mention this for json.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87711 has finished for PR 20666 at commit 654a59b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87720 has finished for PR 20666 at commit fe260c9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87721 has finished for PR 20666 at commit daa326d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 27, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87735 has finished for PR 20666 at commit daa326d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Feb 28, 2018
## What changes were proposed in this pull request?

Clarify JSON and CSV reader behavior in document.

JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.

## How was this patch tested?

Pass existing tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes #20666 from viirya/SPARK-23448-2.

(cherry picked from commit b14993e)
Signed-off-by: hyukjinkwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master and branch-2.3.

@asfgit asfgit closed this in b14993e Feb 28, 2018
@viirya
Copy link
Member Author

viirya commented Feb 28, 2018

Thanks @HyukjinKwon @cloud-fan!

@yogyang
Copy link

yogyang commented Aug 7, 2018

Hi, guys, I am a spark user.
I have a question for this "JSON doesn't support partial results for corrupted records." behavior.
In spark 1.6, the partial results is given, but when upgraded to 2.2, I loss some meaningful datas in my json file.

Could i get those datas come back in spark 2+? @viirya
Thanks for help.

@yogyang
Copy link

yogyang commented Aug 7, 2018

for specify, the json file (Sanity4.json) is
{"a":"a1","int":1,"other":4.4} {"a":"a2","int":"","other":""}

code :

val config = new SparkConf().setMaster("local[5]").setAppName("test")
val sc = SparkContext.getOrCreate(config)
val sql = new SQLContext(sc)

val file_path = this.getClass.getClassLoader.getResource("Sanity4.json").getFile
val df = sql.read.schema(null).json(file_path)
df.show(30)

then in spark 1.6, result is
+---+----+-----+
| a| int|other|
+---+----+-----+
| a1| 1| 4.4|
| a2|null| null|
+---+----+-----+

root
|-- a: string (nullable = true)
|-- int: long (nullable = true)
|-- other: double (nullable = true)

but in spark 2.2, result is
+----+----+-----+
| a| int|other|
+----+----+-----+
| a1| 1| 4.4|
|null|null| null|
+----+----+-----+

root
|-- a: string (nullable = true)
|-- int: long (nullable = true)
|-- other: double (nullable = true)

@HyukjinKwon
Copy link
Member

That's not related to this change. The issue itself seems to be a behaviour change between 1.6 and 2.x for treating empty string as null or not in double and float, which is rather a corner case and which looks, yea, an issue. Let me try to fix it while I'm here.

@yogyang
Copy link

yogyang commented Aug 8, 2018

Hi, Thanks for help.
And do we have follow-up stories on the data loss in spark2.2?
I have tried to use sql.read.format("my.spark.sql.execution.datasources.json").load(file_path) by shading the spark-sql_2.10:1.6.jar to my.spark.sql-shade.jar, But found the json.DefaultSource from spark1.6 can not be used in spark2.2 with error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: my.spark.sql.execution.datasources.json is not a valid Spark SQL Data Source.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
at com.test.shade.SparkTest$.main(SparkTest.scala:32)
at com.test.shade.SparkTest.main(SparkTest.scala)

So can I get some other suggestions ? @HyukjinKwon

Thanks a lot.

peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

Clarify JSON and CSV reader behavior in document.

JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.

## How was this patch tested?

Pass existing tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#20666 from viirya/SPARK-23448-2.

(cherry picked from commit b14993e)
Signed-off-by: hyukjinkwon <[email protected]>
@viirya viirya deleted the SPARK-23448-2 branch December 27, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants