[SPARK-17585][PySpark][Core] PySpark SparkContext.addFile supports adding files recursively #15140

yanboliang · 2016-09-18T14:29:55Z

What changes were proposed in this pull request?

Users would like to add a directory as dependency in some cases, they can use SparkContext.addFile with argument recursive=true to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.

How was this patch tested?

Unit test.

SparkQA · 2016-09-18T16:54:33Z

Test build #65579 has finished for PR 15140 at commit c277a2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-19T07:13:16Z

cc @rxin @davies @srowen

BryanCutler · 2016-09-19T18:13:49Z

python/pyspark/tests.py

+            self.assertEqual("Hello World!\n", test_file.readline())
+        with open(download_path + "/sub_hello/sub_hello.txt") as test_file:
+            self.assertEqual("Sub Hello World!\n", test_file.readline())
+


minor: maybe the above block should be in a separate test like def test_add_file_locally_recursive?

BryanCutler · 2016-09-19T18:16:06Z

Just one minor suggestion, otherwise LGTM

yanboliang · 2016-09-20T13:26:43Z

@BryanCutler Updated, thanks for your comments.

SparkQA · 2016-09-20T15:52:45Z

Test build #65652 has finished for PR 15140 at commit 6174f0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-21T08:16:00Z

@srowen Would you mind to have a look at this when you available? Thanks.

srowen · 2016-09-21T08:25:17Z

core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala

+   *
+   * A directory can be given if the recursive option is set to true. Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */


(Why do we need this?)

Since JavaSparkContext is the Java stubs which will be called by PySpark.

You're calling the method on SparkContext:

self._jsc.sc().addFile(path, recursive)

I don't think this needed to be exposed?

Oh, I see. I found _jsc.sc() and _jsc are mix-used in context.py. I will do some clean up and unify them in a follow up work. Thanks for your comments.

Sounds good. There may be a reason the Java context is needed for some calls. I suppose that where the SparkContext could be used ... yeah that's simpler but doesn't really save anything because we wouldn't be able to take methods out of JavaSparkContext. That's why I was hoping to avoid adding a method to it.

@srowen After investigating the code, I found it's not very straightforward to clean up the interfaces at JavaSparkContext, since they were called by Python and R. For Python side, we can use _jsc.sc() in some cases, but it's messy if we use both JavaSparkContext and JavaSparkContext.sc at R side. So I think we should leave it as it is, or any other suggestion? Thanks.

Yes, but can we undo this change? it doesn't seem like we need to duplicate this method in the Java API.

I think it's not required to undo this, since I will send a PR to support recursively add files under a directory for SparkR soon and it will leverage this API. Thanks.

OK, if that requires the Java context, I get it.

srowen · 2016-09-21T08:26:34Z

python/pyspark/context.py

        return Accumulator(SparkContext._next_accum_id - 1, value, accum_param)

-    def addFile(self, path):
+    def addFile(self, path, recursive=False):


This basically doesn't change the API right? you can still call it as before with the same behavior.

It seems reasonable to me overall because it adds parity between the APIs, isn't complex and doesn't change behavior.

Yes, it does not change the existing API.

yanboliang · 2016-09-21T08:36:15Z

Merged into master. Thanks for all your review. @BryanCutler @srowen

… directory recursively ## What changes were proposed in this pull request? #15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively. ## How was this patch tested? Added unit test. Author: Yanbo Liang <[email protected]> Closes #15216 from yanboliang/spark-17577-2.

PySpark SparkContext.addFile supports adding files recursively

c277a2c

BryanCutler reviewed Sep 19, 2016

View reviewed changes

Make the test case separately.

6174f0d

srowen requested changes Sep 21, 2016

View reviewed changes

asfgit closed this in d3b8869 Sep 21, 2016

yanboliang deleted the spark-17585 branch September 21, 2016 08:41

yanboliang mentioned this pull request Sep 23, 2016

[SPARK-17577][Follow-up][SparkR] SparkR spark.addFile supports adding directory recursively #15216

Closed

jpiper mentioned this pull request Oct 9, 2016

[SPARK-17287] [PYSPARK] Add recursive kwarg to Python SparkContext.addFile #14861

Closed

[SPARK-17585][PySpark][Core] PySpark SparkContext.addFile supports adding files recursively #15140

[SPARK-17585][PySpark][Core] PySpark SparkContext.addFile supports adding files recursively #15140

Uh oh!

Conversation

yanboliang commented Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

yanboliang commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Sep 19, 2016

Uh oh!

yanboliang commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

yanboliang commented Sep 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yanboliang commented Sep 18, 2016 •

edited

Loading

yanboliang commented Sep 20, 2016 •

edited

Loading