[SPARK-3586][streaming]Support nested directories in Spark Streaming #2765

wangxiaojing · 2014-10-11T08:57:59Z

For text files, the method streamingContext.textFileStream(dataDirectory).
The improvement of the streaming to Support subdirectories,spark streaming can monitor the subdirectories dataDirectory and process any files created in that directory.
eg:
streamingContext.textFileStream(/test).
Look at the direction contents:
/test/file1
/test/file2
/test/dr/file1
if the directory "/test/dr/" have new file "file2" ,spark streaming can process the file

AmplabJenkins · 2014-10-11T09:02:06Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-10-11T09:02:10Z

Can one of the admins verify this patch?

srowen · 2014-10-11T09:11:37Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

Remove this System.out

jerryshao · 2014-10-13T13:21:08Z

Hi @wangxiaojing ，a small suggestion, why not making this improvement more flexible by adding a parameter to control the searching depth of directories, this will be more general than the current 1-depth searching implementation. Like:

class FileInputDStream[K: ClassTag, V: ClassTag, F <: NewInputFormat[K,V] : ClassTag](
    @transient ssc_ : StreamingContext,
    directory: String,
    filter: Path => Boolean = FileInputDStream.defaultFilter,
    depth: Int = 1,
    newFilesOnly: Boolean = true)

People can use this parameter to control the searching depth, default 1 keeps the same semantics as current code.

Besides some while space related code styles should be changed to align with Scala style.

wangxiaojing · 2014-10-17T03:47:34Z

Hi @jerryshao,It's changing the code to use this parameter to control the searching depth,but if the depth is greater than 1,the ignore time is not reasonable,because if the secondary subdirectories has a new file,the modification time of the first subdirectories is not change.like:
The streaming monitor the directory /tmp/
The directory structure is :
2014-10-16 19:17 /tmp/spark1
2014-10-16 19:17 /tmp/spark1/spark2

A files created in /tmp/spark1/spark2

2014-10-16 19:17 /tmp/spark1
2014-10-16 19:18 /tmp/spark1/spark2
2014-10-16 19:18 /tmp/spark1/spark2/file

If you use the ignore time to do filtering,the first subdirectories is always ignore,Can you give me some advice?

jerryshao · 2014-10-17T07:53:55Z

Can we just check the time of file, not directory to filter out some unqualified files, I'm not sure about this.

cc @tdas , mind taking a look at this?

wangxiaojing · 2014-10-17T08:53:56Z

@jerryshao @tdas First,According to the depth to check all the directory ,then filter the directory if the modification time more then the ignore time.Is this method optimal? thanks.

wangxiaojing · 2014-10-24T03:43:02Z

@liancheng

liancheng · 2014-10-24T05:52:36Z

streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala

Add space after ,

Remove space before :

Add space after :

Add space after =

erfangc · 2015-05-30T16:47:53Z

This feature would definitely be helpful. Thanks to @wangxiaojing and whoever continuing to work on PR!

zsxwing · 2015-06-01T01:44:35Z

@wangxiaojing could you update this PR? It conflicts with master

zsxwing · 2015-06-01T13:57:57Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

Could you change System.currentTimeMillis to clock.getTimeMillis()?

andrewor14 · 2015-06-18T23:22:03Z

Hi @wangxiaojing it seems that #6588 is an updated version of this PR. Would you mind closing this patch since it no longer merges cleanly with master?

wangxiaojing · 2015-06-23T07:58:41Z

@andrewor14 ok.

srowen reviewed Oct 11, 2014
View reviewed changes

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala Outdated

Copy link

Member

srowen Oct 11, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this System.out

liancheng reviewed Oct 24, 2014
View reviewed changes

wangxiaojing added 17 commits May 18, 2015 19:45

reformat code

7031940

change performance

7bd4811

change filter name

05b5fba

change line exceeds 100 columns

e66b166

line over 100

a63c5a3

remove line

0b4812e

style

8990c35

change get depth

a20743f

Use 'isDir' to modify the compatibility

d7f4880

rebase

99b05d6

support java Api

b6788a3

Add support python api

d2f606c

Additional excludes for checking of Spark's binary compatibility

8e3a054

rebase

571730a

rebase

a4bfac2

change API

5e3fd3d

change MiMa failures

e4b9c22

wangxiaojing force-pushed the spark-3586 branch 2 times, most recently from c6f1c75 to d1c3399 Compare May 20, 2015 08:41

rebase

1a2aae9

wangxiaojing force-pushed the spark-3586 branch from d1c3399 to 1a2aae9 Compare May 20, 2015 08:47

zsxwing reviewed Jun 1, 2015
View reviewed changes

zsxwing mentioned this pull request Jun 2, 2015

[SPARK-3586][streaming]Support nested directories in Spark Streaming #6588

Closed

wangxiaojing closed this Jun 23, 2015

[SPARK-3586][streaming]Support nested directories in Spark Streaming #2765

[SPARK-3586][streaming]Support nested directories in Spark Streaming #2765

Uh oh!

Conversation

wangxiaojing commented Oct 11, 2014

Uh oh!

AmplabJenkins commented Oct 11, 2014

Uh oh!

AmplabJenkins commented Oct 11, 2014

Uh oh!

srowen Oct 11, 2014

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Oct 13, 2014

Uh oh!

wangxiaojing commented Oct 17, 2014

Uh oh!

jerryshao commented Oct 17, 2014

Uh oh!

wangxiaojing commented Oct 17, 2014

Uh oh!

wangxiaojing commented Oct 24, 2014

Uh oh!

liancheng Oct 24, 2014

Choose a reason for hiding this comment

Uh oh!

erfangc commented May 30, 2015

Uh oh!

zsxwing commented Jun 1, 2015

Uh oh!

zsxwing Jun 1, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

wangxiaojing commented Jun 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants