-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17796][SQL] Support wildcard character in filename for LOAD DATA LOCAL INPATH #15376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66437 has finished for PR 15376 at commit
|
|
The only test failure is irrelevant. Locally, it passed. |
|
Retest this please. |
|
Test build #66452 has finished for PR 15376 at commit
|
|
Test build #66618 has finished for PR 15376 at commit
|
|
Test build #66842 has finished for PR 15376 at commit
|
|
Hm, so this did used to work? I always thought inpath was a single file or directory. The Hive docs suggest this is the case, and we've tried to follow Hive I guess. |
|
Oh. Indeed. I thought it's a supported way since it works on Hive 1.2. hive> load data local inpath '/data/t/*.txt' INTO TABLE x;
Loading data to table default.x
Table default.x stats: [numFiles=12, totalSize=613224000]
OK
Time taken: 3.712 secondsAccording to your advice and the URL, it seems not a normal or recommended way. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like you're right, empirically. OK, I could see an argument for this. Let me leave some comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually write .init to mean "all but the last element"
Hm, if you're using the local file system here, can you use the JDK java.nio.file.Path API to do the parsing of elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, it uses java.nio.file.Path APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, you might get one reference to FileSystems.getDefault and reuse it rather than keep retrieving it.
|
Thank you for review, @srowen . I'll update the PR. |
|
|
|
Retest this please. |
|
Test build #67052 has finished for PR 15376 at commit
|
|
Test build #67064 has finished for PR 15376 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looking pretty good. It's another interesting question about how much to follow Hive's behavior, but given it's a simple change and aligns more with Hive I think it's good.
| val fileSystem = FileSystems.getDefault | ||
| val pathPattern = fileSystem.getPath(filePath) | ||
| val dir = pathPattern.getParent.toString | ||
| val filePattern = pathPattern.getName(pathPattern.getNameCount - 1).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getFileName returns the last element in the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll use that.
| if (files == null) { | ||
| false | ||
| } else { | ||
| val matcher = fileSystem.getPathMatcher("glob:" + filePattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking up how this works, and found http://stackoverflow.com/a/14164134/64174 which suggests that this might not work unless the glob starts with "**". However, I wonder if you can just pass this method "glob:" + pathPattern in this case anyway to have it match the whole absolute path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It matches the whole absolute path.
scala> val fs = java.nio.file.FileSystems.getDefault
fs: java.nio.file.FileSystem = sun.nio.fs.MacOSXFileSystem@782dc5
scala> fs.getPathMatcher("glob:/x/1.dat").matches(fs.getPath("/x/1.dat"))
res0: Boolean = true
scala> fs.getPathMatcher("glob:/x/*.dat").matches(fs.getPath("/x/1.dat"))
res1: Boolean = trueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I missed your point. I will update the code to use absolute path here, too.
| test("SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH") { | ||
| withTempDir { dir => | ||
| for (i <- 1 to 3) { | ||
| val writer = new PrintWriter(new File(s"$dir/part-r-0000$i")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS I think you can use Files.write from Guava to do this a little more easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll use Guava one here, too.
|
Test build #67085 has finished for PR 15376 at commit
|
|
Retest this please. |
|
Test build #67083 has finished for PR 15376 at commit
|
|
Test build #67089 has finished for PR 15376 at commit
|
|
Thank you for your review and approval, @srowen ! |
|
Merged to master |
|
Thank you, @srowen ! |
| val fileSystem = FileSystems.getDefault | ||
| val pathPattern = fileSystem.getPath(filePath) | ||
| val dir = pathPattern.getParent.toString | ||
| if (dir.contains("*")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the * appears in the grandparent dir? e.g. dir*/subdir/fileName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will result in an AnalysisException and that seems like the intended behavior. Only a * in the file itself is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @cloud-fan and @srowen . Yes. It's the intended behavior of this PR.
…TA LOCAL INPATH
## What changes were proposed in this pull request?
Currently, Spark 2.0 raises an `input path does not exist` AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for `LOAD DATA LOCAL INPATH` SQL command like Spark 1.6.2.
**Reported Error Scenario**
```scala
scala> sql("CREATE TABLE t(a string)")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;
```
## How was this patch tested?
Pass the Jenkins test with a new test case.
Author: Dongjoon Hyun <[email protected]>
Closes apache#15376 from dongjoon-hyun/SPARK-17796.
…TA LOCAL INPATH
## What changes were proposed in this pull request?
Currently, Spark 2.0 raises an `input path does not exist` AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for `LOAD DATA LOCAL INPATH` SQL command like Spark 1.6.2.
**Reported Error Scenario**
```scala
scala> sql("CREATE TABLE t(a string)")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;
```
## How was this patch tested?
Pass the Jenkins test with a new test case.
Author: Dongjoon Hyun <[email protected]>
Closes apache#15376 from dongjoon-hyun/SPARK-17796.
What changes were proposed in this pull request?
Currently, Spark 2.0 raises an
input path does not existAnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename forLOAD DATA LOCAL INPATHSQL command like Spark 1.6.2.Reported Error Scenario
How was this patch tested?
Pass the Jenkins test with a new test case.