-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17585][PySpark][Core] PySpark SparkContext.addFile supports adding files recursively #15140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65579 has finished for PR 15140 at commit
|
| self.assertEqual("Hello World!\n", test_file.readline()) | ||
| with open(download_path + "/sub_hello/sub_hello.txt") as test_file: | ||
| self.assertEqual("Sub Hello World!\n", test_file.readline()) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: maybe the above block should be in a separate test like def test_add_file_locally_recursive?
|
Just one minor suggestion, otherwise LGTM |
|
@BryanCutler Updated, thanks for your comments. |
|
Test build #65652 has finished for PR 15140 at commit
|
|
@srowen Would you mind to have a look at this when you available? Thanks. |
| * | ||
| * A directory can be given if the recursive option is set to true. Currently directories are only | ||
| * supported for Hadoop-supported filesystems. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Why do we need this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since JavaSparkContext is the Java stubs which will be called by PySpark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're calling the method on SparkContext:
self._jsc.sc().addFile(path, recursive)
I don't think this needed to be exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I found _jsc.sc() and _jsc are mix-used in context.py. I will do some clean up and unify them in a follow up work. Thanks for your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. There may be a reason the Java context is needed for some calls. I suppose that where the SparkContext could be used ... yeah that's simpler but doesn't really save anything because we wouldn't be able to take methods out of JavaSparkContext. That's why I was hoping to avoid adding a method to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen After investigating the code, I found it's not very straightforward to clean up the interfaces at JavaSparkContext, since they were called by Python and R. For Python side, we can use _jsc.sc() in some cases, but it's messy if we use both JavaSparkContext and JavaSparkContext.sc at R side. So I think we should leave it as it is, or any other suggestion? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but can we undo this change? it doesn't seem like we need to duplicate this method in the Java API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not required to undo this, since I will send a PR to support recursively add files under a directory for SparkR soon and it will leverage this API. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if that requires the Java context, I get it.
| return Accumulator(SparkContext._next_accum_id - 1, value, accum_param) | ||
|
|
||
| def addFile(self, path): | ||
| def addFile(self, path, recursive=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically doesn't change the API right? you can still call it as before with the same behavior.
It seems reasonable to me overall because it adds parity between the APIs, isn't complex and doesn't change behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does not change the existing API.
|
Merged into master. Thanks for all your review. @BryanCutler @srowen |
… directory recursively ## What changes were proposed in this pull request? #15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively. ## How was this patch tested? Added unit test. Author: Yanbo Liang <[email protected]> Closes #15216 from yanboliang/spark-17577-2.
What changes were proposed in this pull request?
Users would like to add a directory as dependency in some cases, they can use
SparkContext.addFilewith argumentrecursive=trueto recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.How was this patch tested?
Unit test.