-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25935][SQL] Prevent null rows from JSON parser #22938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
31cb534
0589d91
0aa4f60
d1bad7c
c4d6a80
a7e016a
35b3013
6a8cac3
1ef2d5b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -240,16 +240,6 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |
| Seq(Row("1"), Row("2"))) | ||
| } | ||
|
|
||
| test("SPARK-11226 Skip empty line in json file") { | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed the test because it is not relevant to the default mode
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is it moved to then? Does that mean we don't have a regression test for SPARK-11226 anymore? |
||
| spark.read | ||
| .json(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", "{\"a\": \"3\"}}", "").toDS()) | ||
| .createOrReplaceTempView("d") | ||
|
|
||
| checkAnswer( | ||
| sql("select count(1) from d"), | ||
| Seq(Row(3))) | ||
| } | ||
|
|
||
| test("SPARK-8828 sum should return null if all input values are null") { | ||
| checkAnswer( | ||
| sql("select sum(a), avg(a) from allNulls"), | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -1115,6 +1115,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| Row(null, null, null), | ||||||
| Row(null, null, null), | ||||||
| Row(null, null, null), | ||||||
| Row(null, null, null), | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so for json data source, previous behavior is, we would skip the row even it's in PERMISSIVE mode. Shall we clearly mention it in the migration guide?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, we skipped such rows if
Sure. |
||||||
| Row("str_a_4", "str_b_4", "str_c_4"), | ||||||
| Row(null, null, null)) | ||||||
| ) | ||||||
|
|
@@ -1136,6 +1137,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| checkAnswer( | ||||||
| jsonDF.select($"a", $"b", $"c", $"_unparsed"), | ||||||
| Row(null, null, null, "{") :: | ||||||
| Row(null, null, null, "") :: | ||||||
| Row(null, null, null, """{"a":1, b:2}""") :: | ||||||
| Row(null, null, null, """{"a":{, b:3}""") :: | ||||||
| Row("str_a_4", "str_b_4", "str_c_4", null) :: | ||||||
|
|
@@ -1150,6 +1152,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| checkAnswer( | ||||||
| jsonDF.filter($"_unparsed".isNotNull).select($"_unparsed"), | ||||||
| Row("{") :: | ||||||
| Row("") :: | ||||||
| Row("""{"a":1, b:2}""") :: | ||||||
| Row("""{"a":{, b:3}""") :: | ||||||
| Row("]") :: Nil | ||||||
|
|
@@ -1171,6 +1174,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| checkAnswer( | ||||||
| jsonDF.selectExpr("a", "b", "c", "_malformed"), | ||||||
| Row(null, null, null, "{") :: | ||||||
| Row(null, null, null, "") :: | ||||||
| Row(null, null, null, """{"a":1, b:2}""") :: | ||||||
| Row(null, null, null, """{"a":{, b:3}""") :: | ||||||
| Row("str_a_4", "str_b_4", "str_c_4", null) :: | ||||||
|
|
@@ -1813,6 +1817,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| val path = dir.getCanonicalPath | ||||||
| primitiveFieldAndType | ||||||
| .toDF("value") | ||||||
| .repartition(1) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I remember I added the spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala Line 151 in 46110a5
and closes the empty file in spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala Line 162 in 46110a5
From the read side, when |
||||||
| .write | ||||||
| .option("compression", "GzIp") | ||||||
| .text(path) | ||||||
|
|
@@ -1838,6 +1843,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| val path = dir.getCanonicalPath | ||||||
| primitiveFieldAndType | ||||||
| .toDF("value") | ||||||
| .repartition(1) | ||||||
| .write | ||||||
| .text(path) | ||||||
|
|
||||||
|
|
@@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| .text(path) | ||||||
|
|
||||||
| val jsonDF = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path) | ||||||
| assert(jsonDF.count() === corruptRecordCount) | ||||||
| assert(jsonDF.count() === corruptRecordCount + 1) // null row for empty file | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait, does this mean that it reads an empty record from empty file after this change?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that's true, we should not do this. Empty files can be generated in many cases for now and the behaviour is not currently well defined. If we rely on this behaviour, it will cause some weird behaviours or bugs hard to fix.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we skip empty files for all the file-based data sources? |
||||||
| assert(jsonDF.schema === new StructType() | ||||||
| .add("_corrupt_record", StringType) | ||||||
| .add("dummy", StringType)) | ||||||
|
|
@@ -1905,7 +1911,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| F.count($"dummy").as("valid"), | ||||||
| F.count($"_corrupt_record").as("corrupt"), | ||||||
| F.count("*").as("count")) | ||||||
| checkAnswer(counts, Row(1, 4, 6)) | ||||||
| checkAnswer(counts, Row(1, 5, 7)) // null row for empty file | ||||||
| } | ||||||
| } | ||||||
|
|
||||||
|
|
@@ -2513,7 +2519,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||||
| } | ||||||
|
|
||||||
| checkCount(2) | ||||||
| countForMalformedJSON(0, Seq("")) | ||||||
| countForMalformedJSON(1, Seq("")) | ||||||
| } | ||||||
|
|
||||||
| test("SPARK-25040: empty strings should be disallowed") { | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only happen when we have a bug, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it must not happen.