[SPARK-23148][SQL] Allow pathnames with special characters for CSV / JSON / text #20355

henryr · 2018-01-23T01:26:47Z

…JSON / text

What changes were proposed in this pull request?

Fix for JSON and CSV data sources when file names include characters
that would be changed by URL encoding.

How was this patch tested?

New unit tests for JSON, CSV and text suites

SparkQA · 2018-01-23T01:29:27Z

Test build #86502 has finished for PR 20355 at commit 98da4cd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-23T01:39:25Z

Test build #86503 has finished for PR 20355 at commit 535234c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM otherwise. Can we change the PR title to [SPARK-23148][SQL] ... too?

HyukjinKwon · 2018-01-23T02:29:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

Could we test multiLine here too? I think that's actual test case for the JIRA.

HyukjinKwon · 2018-01-23T03:47:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

Shall we remove this test case and only leave test CSV / JSON with multiLine enabled? I think other tests are basically duplicates of

spark/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

Line 71 in a0aedb0

test(s"SPARK-22146 read files containing special characters using $format") {

.

@HyukjinKwon good idea, thanks for pointing out that test - how about I just add a space to the special characters string, and have the existing test also test multiline on/off for text, csv and json?

Let's just add both multiLine specific tests for CSVSuite and JsonSuite and add a space in "SPARK-22146 read files containing special characters using ..." if you prefer this way.

Alternatively, adding the test like below in FileBasedDataSourceSuite.scala, instead of CSVSuite and JsonSuite:

// Only CSV/JSON supports multiLine option and the code paths are different blabla .. Seq("csv", "json").foreach { format => test(s"SPARK-23148 read files containing special characters using $format - multLine enabled") { val nameWithSpecialChars = s"sp&ci al%chars" withTempDir { dir => ... spark.read.option("multuLine", true).format(format)... } } }

could be also fine if possible. Either way is fine to me.

In the end, to reduce code duplication, I made it so that orc and parquet run multiline as well (I tried to find a neat way to only run multiline if the format was csv, text or json without having a separate test case but it just complicated things). Let me know if you'd rather I have two separate test cases to avoid running the two redundant cases with orc / parquet.

SparkQA · 2018-01-23T04:58:22Z

Test build #86505 has finished for PR 20355 at commit 9c56736.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…JSON / text ## What changes were proposed in this pull request? Fix for JSON and CSV data sources when file names include characters that would be changed by URL encoding. ## How was this patch tested? New unit tests for JSON, CSV and text suites

HyukjinKwon · 2018-01-23T23:53:06Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

-        checkAnswer(fileContent, Seq(Row("a"), Row("b")))
+    test(s"SPARK-22146 / SPARK-23148 read files containing special characters using $format") {
+      val nameWithSpecialChars = s"sp&cial%c hars"
+      Seq(true, false).foreach { multiline =>


Less dup is fine but this case slightly confuses like orc and parquet support multiline, and runs duplicated tests as you pointed out if I should nitpick. I think I prefer a separate test.

Sounds good to me.

HyukjinKwon · 2018-01-24T01:47:39Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

  }
+
+  // Separate test case for text-based formats that support multiLine as an option.
+  Seq("json", "csv", "text").foreach { format =>


Actually "text" doesn't support multiLine but wholetext which runs another code path. Let's take this out.

How about leaving out the tests for SPARK-22146 as was, and just adding a test dedicated for multiline here like:

Seq("json", "csv").foreach { format => test("SPARK-23148 read files containing special characters " + s"using $format multiline enabled") { withTempDir { dir => val tmpFile = s"$dir/$nameWithSpecialChars" spark.createDataset(Seq("a", "b")).write.format(format).save(tmpFile) val reader = spark.read.format(format).option("multiLine", true) val fileContent = reader.load(tmpFile) checkAnswer(fileContent, Seq(Row("a"), Row("b"))) } } }

This PR changes really fix the code paths for both CSV / JSON when multiLine is enabled only .. I am not sure why you don't like this suggestion .. one test less invasive and more targeted for one JIRA ...

That sounds good - my main concern is to make sure that both multiLine=true and multiLine=false have coverage with a space in the name, since they are such different paths. I'll keep the change that adds a space to nameWithSpecialChars, but otherwise have the tests as you suggest - let me know what you think of the next patch!

HyukjinKwon · 2018-01-24T01:48:12Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

  import testImplicits._

  private val allFileBasedDataSources = Seq("orc", "parquet", "csv", "json", "text")
+  private val nameWithSpecialChars = s"sp&cial%c hars"


nit: seems prefix s can be removed.

HyukjinKwon · 2018-01-24T01:51:54Z

Let's don't forget to fix the PR title to [SPARK-23148][SQL] ... . This style is actually documented to be encouraged in this project contributing guide.

SparkQA · 2018-01-24T02:52:46Z

Test build #86552 has finished for PR 20355 at commit 740def4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-24T02:57:29Z

Test build #86554 has finished for PR 20355 at commit 1b9420a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-24T06:08:36Z

Please fix the PR title.

SparkQA · 2018-01-24T08:05:02Z

Test build #86564 has finished for PR 20355 at commit 51b0db5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-24T08:06:42Z

retest this please

SparkQA · 2018-01-24T11:36:26Z

Test build #86577 has finished for PR 20355 at commit 51b0db5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…JSON / text …JSON / text ## What changes were proposed in this pull request? Fix for JSON and CSV data sources when file names include characters that would be changed by URL encoding. ## How was this patch tested? New unit tests for JSON, CSV and text suites Author: Henry Robinson <[email protected]> Closes #20355 from henryr/spark-23148. (cherry picked from commit de36f65) Signed-off-by: hyukjinkwon <[email protected]>

HyukjinKwon · 2018-01-24T12:20:28Z

Merged to master and branch-2.3.

jomach · 2020-01-26T14:56:51Z

This was fixed not the csv data sources but not on the InMemoryFileIndex. This means if people extend FileFormat class we will get this error.

HyukjinKwon · 2020-02-10T12:08:26Z

FileFormat isn't an API.

jomach · 2020-02-10T12:56:41Z

but we can extend from it...

henryr force-pushed the spark-23148 branch from 98da4cd to 535234c Compare January 23, 2018 01:35

henryr force-pushed the spark-23148 branch from 535234c to 9c56736 Compare January 23, 2018 01:43

HyukjinKwon approved these changes Jan 23, 2018

View reviewed changes

HyukjinKwon reviewed Jan 23, 2018

View reviewed changes

henryr force-pushed the spark-23148 branch from 9c56736 to 740def4 Compare January 23, 2018 23:32

HyukjinKwon reviewed Jan 23, 2018

View reviewed changes

Split tests

1b9420a

HyukjinKwon reviewed Jan 24, 2018

View reviewed changes

henryr changed the title ~~SPARK-23148: [SQL] Allow pathnames with special characters for CSV / …~~ [SPARK-23148][SQL] Allow pathnames with special characters for CSV / … Jan 24, 2018

Test changes

51b0db5

henryr changed the title ~~[SPARK-23148][SQL] Allow pathnames with special characters for CSV / …~~ [SPARK-23148][SQL] Allow pathnames with special characters for CSV / JSON / text Jan 24, 2018

asfgit closed this in de36f65 Jan 24, 2018

[SPARK-23148][SQL] Allow pathnames with special characters for CSV / JSON / text #20355

[SPARK-23148][SQL] Allow pathnames with special characters for CSV / JSON / text #20355

Uh oh!

Conversation

henryr commented Jan 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

gatorsmile commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

HyukjinKwon commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

HyukjinKwon commented Jan 24, 2018

Uh oh!

jomach commented Jan 26, 2020

Uh oh!

HyukjinKwon commented Feb 10, 2020

Uh oh!

jomach commented Feb 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon Jan 23, 2018 •

edited

Loading