-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23148][SQL] Allow pathnames with special characters for CSV / JSON / text #20355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #86502 has finished for PR 20355 at commit
|
|
Test build #86503 has finished for PR 20355 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. Can we change the PR title to [SPARK-23148][SQL] ... too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we test multiLine here too? I think that's actual test case for the JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this test case and only leave test CSV / JSON with multiLine enabled? I think other tests are basically duplicates of
spark/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Line 71 in a0aedb0
| test(s"SPARK-22146 read files containing special characters using $format") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon good idea, thanks for pointing out that test - how about I just add a space to the special characters string, and have the existing test also test multiline on/off for text, csv and json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just add both multiLine specific tests for CSVSuite and JsonSuite and add a space in "SPARK-22146 read files containing special characters using ..." if you prefer this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, adding the test like below in FileBasedDataSourceSuite.scala, instead of CSVSuite and JsonSuite:
// Only CSV/JSON supports multiLine option and the code paths are different blabla ..
Seq("csv", "json").foreach { format =>
test(s"SPARK-23148 read files containing special characters using $format - multLine enabled") {
val nameWithSpecialChars = s"sp&ci al%chars"
withTempDir { dir =>
... spark.read.option("multuLine", true).format(format)...
}
}
}could be also fine if possible. Either way is fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, to reduce code duplication, I made it so that orc and parquet run multiline as well (I tried to find a neat way to only run multiline if the format was csv, text or json without having a separate test case but it just complicated things). Let me know if you'd rather I have two separate test cases to avoid running the two redundant cases with orc / parquet.
|
Test build #86505 has finished for PR 20355 at commit
|
…JSON / text ## What changes were proposed in this pull request? Fix for JSON and CSV data sources when file names include characters that would be changed by URL encoding. ## How was this patch tested? New unit tests for JSON, CSV and text suites
| checkAnswer(fileContent, Seq(Row("a"), Row("b"))) | ||
| test(s"SPARK-22146 / SPARK-23148 read files containing special characters using $format") { | ||
| val nameWithSpecialChars = s"sp&cial%c hars" | ||
| Seq(true, false).foreach { multiline => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less dup is fine but this case slightly confuses like orc and parquet support multiline, and runs duplicated tests as you pointed out if I should nitpick. I think I prefer a separate test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
| } | ||
|
|
||
| // Separate test case for text-based formats that support multiLine as an option. | ||
| Seq("json", "csv", "text").foreach { format => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually "text" doesn't support multiLine but wholetext which runs another code path. Let's take this out.
How about leaving out the tests for SPARK-22146 as was, and just adding a test dedicated for multiline here like:
Seq("json", "csv").foreach { format =>
test("SPARK-23148 read files containing special characters " +
s"using $format multiline enabled") {
withTempDir { dir =>
val tmpFile = s"$dir/$nameWithSpecialChars"
spark.createDataset(Seq("a", "b")).write.format(format).save(tmpFile)
val reader = spark.read.format(format).option("multiLine", true)
val fileContent = reader.load(tmpFile)
checkAnswer(fileContent, Seq(Row("a"), Row("b")))
}
}
}This PR changes really fix the code paths for both CSV / JSON when multiLine is enabled only .. I am not sure why you don't like this suggestion .. one test less invasive and more targeted for one JIRA ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good - my main concern is to make sure that both multiLine=true and multiLine=false have coverage with a space in the name, since they are such different paths. I'll keep the change that adds a space to nameWithSpecialChars, but otherwise have the tests as you suggest - let me know what you think of the next patch!
| import testImplicits._ | ||
|
|
||
| private val allFileBasedDataSources = Seq("orc", "parquet", "csv", "json", "text") | ||
| private val nameWithSpecialChars = s"sp&cial%c hars" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems prefix s can be removed.
|
Let's don't forget to fix the PR title to |
|
Test build #86552 has finished for PR 20355 at commit
|
|
Test build #86554 has finished for PR 20355 at commit
|
|
Please fix the PR title. |
|
Test build #86564 has finished for PR 20355 at commit
|
|
retest this please |
|
Test build #86577 has finished for PR 20355 at commit
|
…JSON / text …JSON / text ## What changes were proposed in this pull request? Fix for JSON and CSV data sources when file names include characters that would be changed by URL encoding. ## How was this patch tested? New unit tests for JSON, CSV and text suites Author: Henry Robinson <[email protected]> Closes #20355 from henryr/spark-23148. (cherry picked from commit de36f65) Signed-off-by: hyukjinkwon <[email protected]>
|
Merged to master and branch-2.3. |
|
This was fixed not the csv data sources but not on the InMemoryFileIndex. This means if people extend FileFormat class we will get this error. |
|
|
|
but we can extend from it... |
…JSON / text
What changes were proposed in this pull request?
Fix for JSON and CSV data sources when file names include characters
that would be changed by URL encoding.
How was this patch tested?
New unit tests for JSON, CSV and text suites