Skip to content

Conversation

@henryr
Copy link
Contributor

@henryr henryr commented Jan 23, 2018

…JSON / text

What changes were proposed in this pull request?

Fix for JSON and CSV data sources when file names include characters
that would be changed by URL encoding.

How was this patch tested?

New unit tests for JSON, CSV and text suites

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86502 has finished for PR 20355 at commit 98da4cd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86503 has finished for PR 20355 at commit 535234c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Can we change the PR title to [SPARK-23148][SQL] ... too?

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test multiLine here too? I think that's actual test case for the JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this test case and only leave test CSV / JSON with multiLine enabled? I think other tests are basically duplicates of

test(s"SPARK-22146 read files containing special characters using $format") {
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon good idea, thanks for pointing out that test - how about I just add a space to the special characters string, and have the existing test also test multiline on/off for text, csv and json?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add both multiLine specific tests for CSVSuite and JsonSuite and add a space in "SPARK-22146 read files containing special characters using ..." if you prefer this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, adding the test like below in FileBasedDataSourceSuite.scala, instead of CSVSuite and JsonSuite:

// Only CSV/JSON supports multiLine option and the code paths are different blabla ..
Seq("csv", "json").foreach { format =>
  test(s"SPARK-23148 read files containing special characters using $format - multLine enabled") {
    val nameWithSpecialChars = s"sp&ci al%chars"
    withTempDir { dir =>
      ... spark.read.option("multuLine", true).format(format)...
    }
  }
}

could be also fine if possible. Either way is fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, to reduce code duplication, I made it so that orc and parquet run multiline as well (I tried to find a neat way to only run multiline if the format was csv, text or json without having a separate test case but it just complicated things). Let me know if you'd rather I have two separate test cases to avoid running the two redundant cases with orc / parquet.

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86505 has finished for PR 20355 at commit 9c56736.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…JSON / text

## What changes were proposed in this pull request?

Fix for JSON and CSV data sources when file names include characters
that would be changed by URL encoding.

## How was this patch tested?

New unit tests for JSON, CSV and text suites
checkAnswer(fileContent, Seq(Row("a"), Row("b")))
test(s"SPARK-22146 / SPARK-23148 read files containing special characters using $format") {
val nameWithSpecialChars = s"sp&cial%c hars"
Seq(true, false).foreach { multiline =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less dup is fine but this case slightly confuses like orc and parquet support multiline, and runs duplicated tests as you pointed out if I should nitpick. I think I prefer a separate test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

}

// Separate test case for text-based formats that support multiLine as an option.
Seq("json", "csv", "text").foreach { format =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually "text" doesn't support multiLine but wholetext which runs another code path. Let's take this out.

How about leaving out the tests for SPARK-22146 as was, and just adding a test dedicated for multiline here like:

    Seq("json", "csv").foreach { format =>
      test("SPARK-23148 read files containing special characters " +
        s"using $format multiline enabled") {
        withTempDir { dir =>
          val tmpFile = s"$dir/$nameWithSpecialChars"
          spark.createDataset(Seq("a", "b")).write.format(format).save(tmpFile)
          val reader = spark.read.format(format).option("multiLine", true)
          val fileContent = reader.load(tmpFile)
          checkAnswer(fileContent, Seq(Row("a"), Row("b")))
        }
      }
    }

This PR changes really fix the code paths for both CSV / JSON when multiLine is enabled only .. I am not sure why you don't like this suggestion .. one test less invasive and more targeted for one JIRA ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good - my main concern is to make sure that both multiLine=true and multiLine=false have coverage with a space in the name, since they are such different paths. I'll keep the change that adds a space to nameWithSpecialChars, but otherwise have the tests as you suggest - let me know what you think of the next patch!

import testImplicits._

private val allFileBasedDataSources = Seq("orc", "parquet", "csv", "json", "text")
private val nameWithSpecialChars = s"sp&cial%c hars"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems prefix s can be removed.

@HyukjinKwon
Copy link
Member

Let's don't forget to fix the PR title to [SPARK-23148][SQL] ... . This style is actually documented to be encouraged in this project contributing guide.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86552 has finished for PR 20355 at commit 740def4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86554 has finished for PR 20355 at commit 1b9420a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@henryr henryr changed the title SPARK-23148: [SQL] Allow pathnames with special characters for CSV / … [SPARK-23148][SQL] Allow pathnames with special characters for CSV / … Jan 24, 2018
@gatorsmile
Copy link
Member

Please fix the PR title.

@henryr henryr changed the title [SPARK-23148][SQL] Allow pathnames with special characters for CSV / … [SPARK-23148][SQL] Allow pathnames with special characters for CSV / JSON / text Jan 24, 2018
@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86564 has finished for PR 20355 at commit 51b0db5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86577 has finished for PR 20355 at commit 51b0db5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 24, 2018
…JSON / text

…JSON / text

## What changes were proposed in this pull request?

Fix for JSON and CSV data sources when file names include characters
that would be changed by URL encoding.

## How was this patch tested?

New unit tests for JSON, CSV and text suites

Author: Henry Robinson <[email protected]>

Closes #20355 from henryr/spark-23148.

(cherry picked from commit de36f65)
Signed-off-by: hyukjinkwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master and branch-2.3.

@asfgit asfgit closed this in de36f65 Jan 24, 2018
@jomach
Copy link
Contributor

jomach commented Jan 26, 2020

This was fixed not the csv data sources but not on the InMemoryFileIndex. This means if people extend FileFormat class we will get this error.

@HyukjinKwon
Copy link
Member

FileFormat isn't an API.

@jomach
Copy link
Contributor

jomach commented Feb 10, 2020

but we can extend from it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants