Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 6, 2016

What changes were proposed in this pull request?

Currently, Spark 2.0 raises an input path does not exist AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for LOAD DATA LOCAL INPATH SQL command like Spark 1.6.2.

Reported Error Scenario

scala> sql("CREATE TABLE t(a string)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;

How was this patch tested?

Pass the Jenkins test with a new test case.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66437 has finished for PR 15376 at commit dfdfc46.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The only test failure is irrelevant. Locally, it passed.

[info] - MAP append/extract *** FAILED *** (2 milliseconds)
[info]   java.lang.IllegalArgumentException:

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66452 has finished for PR 15376 at commit dfdfc46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66618 has finished for PR 15376 at commit f328f3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66842 has finished for PR 15376 at commit 982bf3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 13, 2016

Hm, so this did used to work? I always thought inpath was a single file or directory. The Hive docs suggest this is the case, and we've tried to follow Hive I guess.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

@dongjoon-hyun
Copy link
Member Author

Oh. Indeed. I thought it's a supported way since it works on Hive 1.2.

hive> load data local inpath '/data/t/*.txt' INTO TABLE x;
Loading data to table default.x
Table default.x stats: [numFiles=12, totalSize=613224000]
OK
Time taken: 3.712 seconds

According to your advice and the URL, it seems not a normal or recommended way.
Thank you for @srowen . You prefer close this PR and the issue, do you?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like you're right, empirically. OK, I could see an argument for this. Let me leave some comments here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually write .init to mean "all but the last element"
Hm, if you're using the local file system here, can you use the JDK java.nio.file.Path API to do the parsing of elements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, it uses java.nio.file.Path APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, you might get one reference to FileSystems.getDefault and reuse it rather than keep retrieving it.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @srowen . I'll update the PR.
Also, I'll investigate more if there is a reason not to recommend this way.

@dongjoon-hyun
Copy link
Member Author

KafkaSourceSuite failure seems to be irrelevant.

[info] KafkaSourceSuite:
[info] - cannot stop Kafka stream (1 minute, 1 second)
[info] - subscribing topic by name from latest offsets *** FAILED *** (10 seconds, 511 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 669 times over 10.012014778000001 seconds. Last failure message: assertion failed: Partition [topic-2, 0] metadata not propagated after timeout. (KafkaTestUtils.scala:312)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67052 has finished for PR 15376 at commit 401c4ee.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67064 has finished for PR 15376 at commit 401c4ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking pretty good. It's another interesting question about how much to follow Hive's behavior, but given it's a simple change and aligns more with Hive I think it's good.

val fileSystem = FileSystems.getDefault
val pathPattern = fileSystem.getPath(filePath)
val dir = pathPattern.getParent.toString
val filePattern = pathPattern.getName(pathPattern.getNameCount - 1).toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getFileName returns the last element in the path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll use that.

if (files == null) {
false
} else {
val matcher = fileSystem.getPathMatcher("glob:" + filePattern)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking up how this works, and found http://stackoverflow.com/a/14164134/64174 which suggests that this might not work unless the glob starts with "**". However, I wonder if you can just pass this method "glob:" + pathPattern in this case anyway to have it match the whole absolute path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It matches the whole absolute path.

scala> val fs = java.nio.file.FileSystems.getDefault
fs: java.nio.file.FileSystem = sun.nio.fs.MacOSXFileSystem@782dc5

scala> fs.getPathMatcher("glob:/x/1.dat").matches(fs.getPath("/x/1.dat"))
res0: Boolean = true

scala> fs.getPathMatcher("glob:/x/*.dat").matches(fs.getPath("/x/1.dat"))
res1: Boolean = true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I missed your point. I will update the code to use absolute path here, too.

test("SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH") {
withTempDir { dir =>
for (i <- 1 to 3) {
val writer = new PrintWriter(new File(s"$dir/part-r-0000$i"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS I think you can use Files.write from Guava to do this a little more easily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll use Guava one here, too.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67085 has finished for PR 15376 at commit 933ad85.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67083 has finished for PR 15376 at commit c74191b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67089 has finished for PR 15376 at commit 933ad85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you for your review and approval, @srowen !

@srowen
Copy link
Member

srowen commented Oct 20, 2016

Merged to master

@asfgit asfgit closed this in 986a3b8 Oct 20, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you, @srowen !

val fileSystem = FileSystems.getDefault
val pathPattern = fileSystem.getPath(filePath)
val dir = pathPattern.getParent.toString
if (dir.contains("*")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the * appears in the grandparent dir? e.g. dir*/subdir/fileName

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will result in an AnalysisException and that seems like the intended behavior. Only a * in the file itself is supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @cloud-fan and @srowen . Yes. It's the intended behavior of this PR.

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…TA LOCAL INPATH

## What changes were proposed in this pull request?

Currently, Spark 2.0 raises an `input path does not exist` AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for `LOAD DATA LOCAL INPATH` SQL command like Spark 1.6.2.

**Reported Error Scenario**
```scala
scala> sql("CREATE TABLE t(a string)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;
```

## How was this patch tested?

Pass the Jenkins test with a new test case.

Author: Dongjoon Hyun <[email protected]>

Closes apache#15376 from dongjoon-hyun/SPARK-17796.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-17796 branch November 7, 2016 00:50
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…TA LOCAL INPATH

## What changes were proposed in this pull request?

Currently, Spark 2.0 raises an `input path does not exist` AnalysisException if the file name contains '*'. It is misleading since it occurs when there exist some matched files. Also, it was a supported feature in Spark 1.6.2. This PR aims to support wildcard characters in filename for `LOAD DATA LOCAL INPATH` SQL command like Spark 1.6.2.

**Reported Error Scenario**
```scala
scala> sql("CREATE TABLE t(a string)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("LOAD DATA LOCAL INPATH '/tmp/x*' INTO TABLE t")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /tmp/x*;
```

## How was this patch tested?

Pass the Jenkins test with a new test case.

Author: Dongjoon Hyun <[email protected]>

Closes apache#15376 from dongjoon-hyun/SPARK-17796.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants