Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Sep 18, 2016

What changes were proposed in this pull request?

Users would like to add a directory as dependency in some cases, they can use SparkContext.addFile with argument recursive=true to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.

How was this patch tested?

Unit test.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65579 has finished for PR 15140 at commit c277a2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @rxin @davies @srowen

self.assertEqual("Hello World!\n", test_file.readline())
with open(download_path + "/sub_hello/sub_hello.txt") as test_file:
self.assertEqual("Sub Hello World!\n", test_file.readline())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe the above block should be in a separate test like def test_add_file_locally_recursive?

@BryanCutler
Copy link
Member

Just one minor suggestion, otherwise LGTM

@yanboliang
Copy link
Contributor Author

yanboliang commented Sep 20, 2016

@BryanCutler Updated, thanks for your comments.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65652 has finished for PR 15140 at commit 6174f0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@srowen Would you mind to have a look at this when you available? Thanks.

*
* A directory can be given if the recursive option is set to true. Currently directories are only
* supported for Hadoop-supported filesystems.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Why do we need this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since JavaSparkContext is the Java stubs which will be called by PySpark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're calling the method on SparkContext:

self._jsc.sc().addFile(path, recursive)

I don't think this needed to be exposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I found _jsc.sc() and _jsc are mix-used in context.py. I will do some clean up and unify them in a follow up work. Thanks for your comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. There may be a reason the Java context is needed for some calls. I suppose that where the SparkContext could be used ... yeah that's simpler but doesn't really save anything because we wouldn't be able to take methods out of JavaSparkContext. That's why I was hoping to avoid adding a method to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen After investigating the code, I found it's not very straightforward to clean up the interfaces at JavaSparkContext, since they were called by Python and R. For Python side, we can use _jsc.sc() in some cases, but it's messy if we use both JavaSparkContext and JavaSparkContext.sc at R side. So I think we should leave it as it is, or any other suggestion? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but can we undo this change? it doesn't seem like we need to duplicate this method in the Java API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not required to undo this, since I will send a PR to support recursively add files under a directory for SparkR soon and it will leverage this API. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if that requires the Java context, I get it.

return Accumulator(SparkContext._next_accum_id - 1, value, accum_param)

def addFile(self, path):
def addFile(self, path, recursive=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically doesn't change the API right? you can still call it as before with the same behavior.

It seems reasonable to me overall because it adds parity between the APIs, isn't complex and doesn't change behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does not change the existing API.

@yanboliang
Copy link
Contributor Author

Merged into master. Thanks for all your review. @BryanCutler @srowen

@asfgit asfgit closed this in d3b8869 Sep 21, 2016
@yanboliang yanboliang deleted the spark-17585 branch September 21, 2016 08:41
asfgit pushed a commit that referenced this pull request Sep 26, 2016
… directory recursively

## What changes were proposed in this pull request?
#15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively.

## How was this patch tested?
Added unit test.

Author: Yanbo Liang <[email protected]>

Closes #15216 from yanboliang/spark-17577-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants