-
Notifications
You must be signed in to change notification settings - Fork 383
Add SparkFiles.get() API to access files added through addFile(). #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This should avoid exceptions caused by existing files with different contents. I also removed some unused code.
|
I just realized that this introduces a bug in SparkContext.addPyFile: we need to append to the PYTHONPATH on the worker machine because we can no longer assume that these files will be in the current working directory. Working on a fix now. |
|
I fixed the bug and added a test, so this should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a closing tick
|
Made a few small comments. |
|
I just realized that the PySpark version of SparkFiles may not work correctly if it is used inside the user's Python driver code (e.g. outside of closures that run in the Python worker process). It shouldn't be too hard to support this, though, since I can add code that detects whether we're running on the worker or the driver, allowing PySpark's SparkFiles to directly call its Java counterpart when running in the driver. |
Fix minor documentation formatting issues.
|
I added support for using SparkFiles in the Python driver and addressed the documentation typos. |
Add SparkFiles.get() API to access files added through addFile().
|
Great, thanks. |
Better error handling in Spark Streaming and more API cleanup Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug. This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext. Also fixes the bug that came up with PRs mesos#393 and mesos#381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in mesos#392. And changed a lot of protected[streaming] to private[streaming].
This pull request adds a SparkFiles object that is used to access files added with addFile(). This allows us to break the assumption that these files are downloaded to the current working directory, which caused problems when dependencies were fetched on the master / driver machine: https://groups.google.com/d/topic/spark-developers/BG8a9guHgG4/discussion
This change was originally proposed in #345. It is backwards-incompatible, but upgrading user code should be pretty simple.
It also fixes SPARK-662 by adding synchronization to Executor.updateDependencies().
I plan to port #345 and the synchronization fix back to branch-0.6, but I'll do that in a separate pull request.
I also deleted some unused code.