[SPARK-17796][SQL] Support wildcard character in filename for LOAD DATA LOCAL INPATH #15376

dongjoon-hyun · 2016-10-06T07:42:02Z

What changes were proposed in this pull request?

Currently, Spark 2.0 raises an input path does not exist AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for LOAD DATA LOCAL INPATH SQL command like Spark 1.6.2.

Reported Error Scenario

scala> sql("CREATE TABLE t(a string)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;

How was this patch tested?

Pass the Jenkins test with a new test case.

SparkQA · 2016-10-06T09:20:44Z

Test build #66437 has finished for PR 15376 at commit dfdfc46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-10-06T16:52:23Z

The only test failure is irrelevant. Locally, it passed.

[info] - MAP append/extract *** FAILED *** (2 milliseconds)
[info]   java.lang.IllegalArgumentException:

dongjoon-hyun · 2016-10-06T16:52:31Z

Retest this please.

SparkQA · 2016-10-06T19:10:14Z

Test build #66452 has finished for PR 15376 at commit dfdfc46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T04:33:34Z

Test build #66618 has finished for PR 15376 at commit f328f3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T23:21:23Z

Test build #66842 has finished for PR 15376 at commit 982bf3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-10-13T09:28:25Z

Hm, so this did used to work? I always thought inpath was a single file or directory. The Hive docs suggest this is the case, and we've tried to follow Hive I guess.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

dongjoon-hyun · 2016-10-14T04:35:49Z

Oh. Indeed. I thought it's a supported way since it works on Hive 1.2.

hive> load data local inpath '/data/t/*.txt' INTO TABLE x;
Loading data to table default.x
Table default.x stats: [numFiles=12, totalSize=613224000]
OK
Time taken: 3.712 seconds

According to your advice and the URL, it seems not a normal or recommended way.
Thank you for @srowen . You prefer close this PR and the issue, do you?

srowen

Hm, looks like you're right, empirically. OK, I could see an argument for this. Let me leave some comments here.

srowen · 2016-10-14T09:01:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

I usually write .init to mean "all but the last element"
Hm, if you're using the local file system here, can you use the JDK java.nio.file.Path API to do the parsing of elements?

Now, it uses java.nio.file.Path APIs.

srowen · 2016-10-14T09:02:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

Nit, you might get one reference to FileSystems.getDefault and reuse it rather than keep retrieving it.

dongjoon-hyun · 2016-10-14T19:40:46Z

Thank you for review, @srowen . I'll update the PR.
Also, I'll investigate more if there is a reason not to recommend this way.

…TA LOCAL INPATH

dongjoon-hyun · 2016-10-17T08:53:12Z

KafkaSourceSuite failure seems to be irrelevant.

[info] KafkaSourceSuite:
[info] - cannot stop Kafka stream (1 minute, 1 second)
[info] - subscribing topic by name from latest offsets *** FAILED *** (10 seconds, 511 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 669 times over 10.012014778000001 seconds. Last failure message: assertion failed: Partition [topic-2, 0] metadata not propagated after timeout. (KafkaTestUtils.scala:312)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:

dongjoon-hyun · 2016-10-17T08:53:23Z

Retest this please.

SparkQA · 2016-10-17T10:07:32Z

Test build #67052 has finished for PR 15376 at commit 401c4ee.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-17T11:02:40Z

Test build #67064 has finished for PR 15376 at commit 401c4ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Yeah, looking pretty good. It's another interesting question about how much to follow Hive's behavior, but given it's a simple change and aligns more with Hive I think it's good.

srowen · 2016-10-17T11:04:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+          val fileSystem = FileSystems.getDefault
+          val pathPattern = fileSystem.getPath(filePath)
+          val dir = pathPattern.getParent.toString
+          val filePattern = pathPattern.getName(pathPattern.getNameCount - 1).toString


I think getFileName returns the last element in the path?

Thanks. I'll use that.

srowen · 2016-10-17T11:11:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+          if (files == null) {
+            false
+          } else {
+            val matcher = fileSystem.getPathMatcher("glob:" + filePattern)


I was looking up how this works, and found http://stackoverflow.com/a/14164134/64174 which suggests that this might not work unless the glob starts with "**". However, I wonder if you can just pass this method "glob:" + pathPattern in this case anyway to have it match the whole absolute path?

Yes. It matches the whole absolute path.

scala> val fs = java.nio.file.FileSystems.getDefault fs: java.nio.file.FileSystem = sun.nio.fs.MacOSXFileSystem@782dc5 scala> fs.getPathMatcher("glob:/x/1.dat").matches(fs.getPath("/x/1.dat")) res0: Boolean = true scala> fs.getPathMatcher("glob:/x/*.dat").matches(fs.getPath("/x/1.dat")) res1: Boolean = true

Ah, I think I missed your point. I will update the code to use absolute path here, too.

srowen · 2016-10-17T11:12:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+  test("SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH") {
+    withTempDir { dir =>
+      for (i <- 1 to 3) {
+        val writer = new PrintWriter(new File(s"$dir/part-r-0000$i"))


PS I think you can use Files.write from Guava to do this a little more easily

Sure, I'll use Guava one here, too.

SparkQA · 2016-10-17T19:44:59Z

Test build #67085 has finished for PR 15376 at commit 933ad85.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-10-17T21:25:57Z

Retest this please.

SparkQA · 2016-10-17T21:26:05Z

Test build #67083 has finished for PR 15376 at commit c74191b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-17T23:35:17Z

Test build #67089 has finished for PR 15376 at commit 933ad85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-10-19T09:50:37Z

Thank you for your review and approval, @srowen !

srowen · 2016-10-20T08:53:25Z

Merged to master

dongjoon-hyun · 2016-10-20T10:59:10Z

Thank you, @srowen !

cloud-fan · 2016-10-20T12:32:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+          val fileSystem = FileSystems.getDefault
+          val pathPattern = fileSystem.getPath(filePath)
+          val dir = pathPattern.getParent.toString
+          if (dir.contains("*")) {


what if the * appears in the grandparent dir? e.g. dir*/subdir/fileName

It will result in an AnalysisException and that seems like the intended behavior. Only a * in the file itself is supported.

Thank you, @cloud-fan and @srowen . Yes. It's the intended behavior of this PR.

…TA LOCAL INPATH ## What changes were proposed in this pull request? Currently, Spark 2.0 raises an `input path does not exist` AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for `LOAD DATA LOCAL INPATH` SQL command like Spark 1.6.2. **Reported Error Scenario** ```scala scala> sql("CREATE TABLE t(a string)") res0: org.apache.spark.sql.DataFrame = [] scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*; ``` ## How was this patch tested? Pass the Jenkins test with a new test case. Author: Dongjoon Hyun <[email protected]> Closes apache#15376 from dongjoon-hyun/SPARK-17796.

srowen reviewed Oct 14, 2016

View reviewed changes

dongjoon-hyun added 2 commits October 16, 2016 21:43

[SPARK-17796][SQL] Support wildcard character in filename for LOAD DA…

523ced2

…TA LOCAL INPATH

Address comments.

401c4ee

srowen requested changes Oct 17, 2016

View reviewed changes

dongjoon-hyun added 2 commits October 17, 2016 12:01

Use Guava Files.write and getFileName.

c74191b

Use absolute path pattern match.

933ad85

srowen approved these changes Oct 19, 2016

View reviewed changes

asfgit closed this in 986a3b8 Oct 20, 2016

cloud-fan reviewed Oct 20, 2016

View reviewed changes

dongjoon-hyun deleted the SPARK-17796 branch November 7, 2016 00:50

[SPARK-17796][SQL] Support wildcard character in filename for LOAD DATA LOCAL INPATH #15376

[SPARK-17796][SQL] Support wildcard character in filename for LOAD DATA LOCAL INPATH #15376

Uh oh!

Conversation

dongjoon-hyun commented Oct 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

dongjoon-hyun commented Oct 6, 2016

Uh oh!

dongjoon-hyun commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

srowen commented Oct 13, 2016

Uh oh!

dongjoon-hyun commented Oct 14, 2016

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 14, 2016

Uh oh!

dongjoon-hyun commented Oct 17, 2016

Uh oh!

dongjoon-hyun commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

dongjoon-hyun commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

dongjoon-hyun commented Oct 19, 2016

Uh oh!

srowen commented Oct 20, 2016

Uh oh!

dongjoon-hyun commented Oct 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 6, 2016 •

edited

Loading