Skip to content

Conversation

@JoshRosen
Copy link
Member

This pull request adds a SparkFiles object that is used to access files added with addFile(). This allows us to break the assumption that these files are downloaded to the current working directory, which caused problems when dependencies were fetched on the master / driver machine: https://groups.google.com/d/topic/spark-developers/BG8a9guHgG4/discussion

This change was originally proposed in #345. It is backwards-incompatible, but upgrading user code should be pretty simple.

It also fixes SPARK-662 by adding synchronization to Executor.updateDependencies().

I plan to port #345 and the synchronization fix back to branch-0.6, but I'll do that in a separate pull request.

I also deleted some unused code.

This should avoid exceptions caused by existing
files with different contents.

I also removed some unused code.
@JoshRosen
Copy link
Member Author

I just realized that this introduces a bug in SparkContext.addPyFile: we need to append to the PYTHONPATH on the worker machine because we can no longer assume that these files will be in the current working directory. Working on a fix now.

@JoshRosen
Copy link
Member Author

I fixed the bug and added a test, so this should be ready for review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a closing tick

@mateiz
Copy link
Member

mateiz commented Jan 23, 2013

Made a few small comments.

@JoshRosen
Copy link
Member Author

I just realized that the PySpark version of SparkFiles may not work correctly if it is used inside the user's Python driver code (e.g. outside of closures that run in the Python worker process). It shouldn't be too hard to support this, though, since I can add code that detects whether we're running on the worker or the driver, allowing PySpark's SparkFiles to directly call its Java counterpart when running in the driver.

Fix minor documentation formatting issues.
@JoshRosen
Copy link
Member Author

I added support for using SparkFiles in the Python driver and addressed the documentation typos.

mateiz added a commit that referenced this pull request Jan 23, 2013
Add SparkFiles.get() API to access files added through addFile().
@mateiz mateiz merged commit 4d77d55 into mesos:master Jan 23, 2013
@mateiz
Copy link
Member

mateiz commented Jan 23, 2013

Great, thanks.

harveyfeng pushed a commit to harveyfeng/spark-mesos that referenced this pull request Jan 14, 2014
Better error handling in Spark Streaming and more API cleanup

Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug.

This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext.

Also fixes the bug that came up with PRs mesos#393 and mesos#381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in mesos#392.

And changed a lot of protected[streaming] to private[streaming].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants