Add SparkFiles.get() API to access files added through addFile(). #394

JoshRosen · 2013-01-22T01:37:37Z

This pull request adds a SparkFiles object that is used to access files added with addFile(). This allows us to break the assumption that these files are downloaded to the current working directory, which caused problems when dependencies were fetched on the master / driver machine: https://groups.google.com/d/topic/spark-developers/BG8a9guHgG4/discussion

This change was originally proposed in #345. It is backwards-incompatible, but upgrading user code should be pretty simple.

It also fixes SPARK-662 by adding synchronization to Executor.updateDependencies().

I plan to port #345 and the synchronization fix back to branch-0.6, but I'll do that in a separate pull request.

I also deleted some unused code.

This should avoid exceptions caused by existing files with different contents. I also removed some unused code.

JoshRosen · 2013-01-22T19:52:18Z

I just realized that this introduces a bug in SparkContext.addPyFile: we need to append to the PYTHONPATH on the worker machine because we can no longer assume that these files will be in the current working directory. Working on a fix now.

JoshRosen · 2013-01-23T01:57:46Z

I fixed the bug and added a test, so this should be ready for review.

mateiz · 2013-01-23T05:50:17Z

core/src/main/scala/spark/SparkFiles.java

Missing a closing tick

mateiz · 2013-01-23T05:54:03Z

Made a few small comments.

JoshRosen · 2013-01-23T06:06:54Z

I just realized that the PySpark version of SparkFiles may not work correctly if it is used inside the user's Python driver code (e.g. outside of closures that run in the Python worker process). It shouldn't be too hard to support this, though, since I can add code that detects whether we're running on the worker or the driver, allowing PySpark's SparkFiles to directly call its Java counterpart when running in the driver.

Fix minor documentation formatting issues.

JoshRosen · 2013-01-23T19:11:02Z

I added support for using SparkFiles in the Python driver and addressed the documentation typos.

Add SparkFiles.get() API to access files added through addFile().

mateiz · 2013-01-23T20:16:34Z

Great, thanks.

Better error handling in Spark Streaming and more API cleanup Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug. This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext. Also fixes the bug that came up with PRs mesos#393 and mesos#381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in mesos#392. And changed a lot of protected[streaming] to private[streaming].

JoshRosen added 2 commits January 21, 2013 17:34

Don't download files to master's working directory.

ef71190

This should avoid exceptions caused by existing files with different contents. I also removed some unused code.

Add synchronization to Executor.updateDependencies() (SPARK-662)

7b9e96c

Fix sys.path bug in PySpark SparkContext.addPyFile

35168d9

mateiz reviewed Jan 23, 2013
View reviewed changes

core/src/main/scala/spark/SparkFiles.java Outdated

Copy link

Member

mateiz Jan 23, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a closing tick

Allow PySpark's SparkFiles to be used from driver

ae2ed29

Fix minor documentation formatting issues.

mateiz added a commit that referenced this pull request Jan 23, 2013

Merge pull request #394 from JoshRosen/add_file_fix

4d77d55

Add SparkFiles.get() API to access files added through addFile().

mateiz merged commit 4d77d55 into mesos:master Jan 23, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SparkFiles.get() API to access files added through addFile(). #394

Add SparkFiles.get() API to access files added through addFile(). #394

Uh oh!

JoshRosen commented Jan 22, 2013

Uh oh!

JoshRosen commented Jan 22, 2013

Uh oh!

JoshRosen commented Jan 23, 2013

Uh oh!

mateiz Jan 23, 2013

Uh oh!

mateiz commented Jan 23, 2013

Uh oh!

JoshRosen commented Jan 23, 2013

Uh oh!

JoshRosen commented Jan 23, 2013

Uh oh!

mateiz commented Jan 23, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add SparkFiles.get() API to access files added through addFile(). #394

Add SparkFiles.get() API to access files added through addFile(). #394

Uh oh!

Conversation

JoshRosen commented Jan 22, 2013

Uh oh!

JoshRosen commented Jan 22, 2013

Uh oh!

JoshRosen commented Jan 23, 2013

Uh oh!

mateiz Jan 23, 2013

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jan 23, 2013

Uh oh!

JoshRosen commented Jan 23, 2013

Uh oh!

JoshRosen commented Jan 23, 2013

Uh oh!

mateiz commented Jan 23, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants