Skip to content

Conversation

@patrickotoole
Copy link

Added recursive directory search to fileInputStream. Want spark to be able to find files in the subdirectories rather than just the parent directory.

@tdas
Copy link
Contributor

tdas commented Apr 24, 2014

Can you please add a JIRA for this and add the JIRA number in the title, like other PRs.

@tdas
Copy link
Contributor

tdas commented Apr 24, 2014

Also, please add a unit test for this usecase in the InputStreamsSuite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an api change - please add default value to recursive

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included a default value on the FileInputDStream but not on the API itself.

Wondering if we want to introduce default values to the more granular version of the API. Currently, it looks like the exposed API essentially has two versions for these methods -- one that assumes default values and one that exposes all the parameters of the DStream constructor.

Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which version of spark can we get the API with support for nested directory streaming?

@patrickotoole patrickotoole changed the title Add recursive directory file search to fileInputStream SPARK-1795 - Add recursive directory file search to fileInputStream May 11, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input directory is already the lowest level of directory then it will not consider any files in it.
example:
Consider the following directory.
/a/file1.txt
/a/file2.txt and so on .
If the input directory is given as "/a", there will be no output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call this like

val filePaths: Array[Path] = if (recursive)
recursiveListDirs(List(fs.getFileStatus(new Path(directoryPath)))).toArray

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@tdas
Copy link
Contributor

tdas commented Dec 24, 2014

@patrickotoole Sorry for this patch sitting around here for so long without any attention. Mind updating this patch to the latest code.

@srowen
Copy link
Member

srowen commented Jan 23, 2015

I suggest we close this in favor of #2765 since it implements recursion with max depth, merges, and was active more recently.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Apr 27, 2015

Mind closing this PR?

@asfgit asfgit closed this in 8dee274 Apr 29, 2015
helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Aug 20, 2019
One-line code change which is the initial patch for [HADOOP-16248](https://issues.apache.org/jira/browse/HADOOP-16248). See internal ticket number 87611 for more context.
helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Aug 20, 2019
* [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data (apache#531)
* [SPARK-27514][SQL] Skip collapsing windows with empty window expressions (apache#538)
* Bump hadoop to 2.9.2-palantir.5 (apache#537)
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…ileStreamSink.hasMetadata (apache#537)

### What changes were proposed in this pull request?

This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled

### Why are the changes needed?

bugfix for improperly handled exceptions in FileStreamSink.hasMetadata

### Does this PR introduce _any_ user-facing change?

no, an invalid path is still invalid, but fails in the correct places

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47471 from yaooqinn/SPARK-48991.

Authored-by: Kent Yao <[email protected]>

(cherry picked from commit d68cde8)

Signed-off-by: Kent Yao <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants