SPARK-1795 - Add recursive directory file search to fileInputStream #537

patrickotoole · 2014-04-24T18:29:34Z

Added recursive directory search to fileInputStream. Want spark to be able to find files in the subdirectories rather than just the parent directory.

tdas · 2014-04-24T21:24:16Z

Can you please add a JIRA for this and add the JIRA number in the title, like other PRs.

tdas · 2014-04-24T21:24:57Z

Also, please add a unit test for this usecase in the InputStreamsSuite

mridulm · 2014-04-27T13:40:34Z

streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala

This looks like an api change - please add default value to recursive

I have included a default value on the FileInputDStream but not on the API itself.

Wondering if we want to introduce default values to the more granular version of the API. Currently, it looks like the exposed API essentially has two versions for these methods -- one that assumes default values and one that exposes all the parameters of the DStream constructor.

Thoughts?

In which version of spark can we get the API with support for nested directory streaming?

rajat-agarwal · 2014-06-28T14:25:00Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

If the input directory is already the lowest level of directory then it will not consider any files in it.
example:
Consider the following directory.
/a/file1.txt
/a/file2.txt and so on .
If the input directory is given as "/a", there will be no output.

We can call this like

val filePaths: Array[Path] = if (recursive)
recursiveListDirs(List(fs.getFileStatus(new Path(directoryPath)))).toArray

SparkQA · 2014-09-05T23:47:02Z

Can one of the admins verify this patch?

tdas · 2014-12-24T23:51:52Z

@patrickotoole Sorry for this patch sitting around here for so long without any attention. Mind updating this patch to the latest code.

srowen · 2015-01-23T14:12:17Z

I suggest we close this in favor of #2765 since it implements recursion with max depth, merges, and was active more recently.

AmplabJenkins · 2015-04-27T18:24:48Z

Can one of the admins verify this patch?

srowen · 2015-04-27T18:27:17Z

Mind closing this PR?

One-line code change which is the initial patch for [HADOOP-16248](https://issues.apache.org/jira/browse/HADOOP-16248). See internal ticket number 87611 for more context.

* [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data (apache#531) * [SPARK-27514][SQL] Skip collapsing windows with empty window expressions (apache#538) * Bump hadoop to 2.9.2-palantir.5 (apache#537)

Update ssh_known_hosts

…ileStreamSink.hasMetadata (apache#537) ### What changes were proposed in this pull request? This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled ### Why are the changes needed? bugfix for improperly handled exceptions in FileStreamSink.hasMetadata ### Does this PR introduce _any_ user-facing change? no, an invalid path is still invalid, but fails in the correct places ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47471 from yaooqinn/SPARK-48991. Authored-by: Kent Yao <[email protected]> (cherry picked from commit d68cde8) Signed-off-by: Kent Yao <[email protected]> Co-authored-by: Kent Yao <[email protected]>

Add recursive directory file search to fileinputstream

d2c3430

mridulm reviewed Apr 27, 2014
View reviewed changes

patrickotoole added 2 commits May 8, 2014 06:02

Merge branch 'master' of https://github.com/apache/spark

2579c1e

added unit tests for recursive search and modified recursive search

8c0d50c

patrickotoole changed the title ~~Add recursive directory file search to fileInputStream~~ SPARK-1795 - Add recursive directory file search to fileInputStream May 11, 2014

rajat-agarwal reviewed Jun 28, 2014
View reviewed changes

asfgit closed this in 8dee274 Apr 29, 2015

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Update ssh_known_hosts (apache#537)

b69a0a8

Update ssh_known_hosts

SPARK-1795 - Add recursive directory file search to fileInputStream #537

SPARK-1795 - Add recursive directory file search to fileInputStream #537

Uh oh!

Conversation

patrickotoole commented Apr 24, 2014

Uh oh!

tdas commented Apr 24, 2014

Uh oh!

tdas commented Apr 24, 2014

Uh oh!

mridulm Apr 27, 2014

Choose a reason for hiding this comment

Uh oh!

patrickotoole May 11, 2014

Choose a reason for hiding this comment

Uh oh!

stshruthi Nov 28, 2017

Choose a reason for hiding this comment

Uh oh!

rajat-agarwal Jun 28, 2014

Choose a reason for hiding this comment

Uh oh!

rajat-agarwal Jun 30, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

tdas commented Dec 24, 2014

Uh oh!

srowen commented Jan 23, 2015

Uh oh!

AmplabJenkins commented Apr 27, 2015

Uh oh!

srowen commented Apr 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants