-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3586][streaming]Support nested directories in Spark Streaming #2765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
1 similar comment
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this System.out
|
Hi @wangxiaojing ,a small suggestion, why not making this improvement more flexible by adding a parameter to control the searching depth of directories, this will be more general than the current 1-depth searching implementation. Like: class FileInputDStream[K: ClassTag, V: ClassTag, F <: NewInputFormat[K,V] : ClassTag](
@transient ssc_ : StreamingContext,
directory: String,
filter: Path => Boolean = FileInputDStream.defaultFilter,
depth: Int = 1,
newFilesOnly: Boolean = true)People can use this parameter to control the searching depth, default 1 keeps the same semantics as current code. Besides some while space related code styles should be changed to align with Scala style. |
|
Hi @jerryshao,It's changing the code to use this parameter to control the searching depth,but if the depth is greater than 1,the ignore time is not reasonable,because if the secondary subdirectories has a new file,the modification time of the first subdirectories is not change.like: A files created in /tmp/spark1/spark2 2014-10-16 19:17 /tmp/spark1 If you use the ignore time to do filtering,the first subdirectories is always ignore,Can you give me some advice? |
|
Can we just check the time of file, not directory to filter out some unqualified files, I'm not sure about this. cc @tdas , mind taking a look at this? |
|
@jerryshao @tdas First,According to the depth to check all the directory ,then filter the directory if the modification time more then the ignore time.Is this method optimal? thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add space after
, - Remove space before
: - Add space after
: - Add space after
=
c6f1c75 to
d1c3399
Compare
|
This feature would definitely be helpful. Thanks to @wangxiaojing and whoever continuing to work on PR! |
|
@wangxiaojing could you update this PR? It conflicts with master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change System.currentTimeMillis to clock.getTimeMillis()?
|
Hi @wangxiaojing it seems that #6588 is an updated version of this PR. Would you mind closing this patch since it no longer merges cleanly with master? |
|
@andrewor14 ok. |
For text files, the method streamingContext.textFileStream(dataDirectory).
The improvement of the streaming to Support subdirectories,spark streaming can monitor the subdirectories dataDirectory and process any files created in that directory.
eg:
streamingContext.textFileStream(/test).
Look at the direction contents:
/test/file1
/test/file2
/test/dr/file1
if the directory "/test/dr/" have new file "file2" ,spark streaming can process the file