-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-12729. PhantomReference to replace finalize in python broadcast… #11257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
….Redesigned the code to keep track of the creation and closing of the phantom reference thread.Fixed null pointer exceptions
|
Test build #2547 has finished for PR 11257 at commit
|
| private[spark] class FilePhantomReference(@transient var f: File, var q: ReferenceQueue[File]) | ||
| extends PhantomReference(f, q){ | ||
|
|
||
| private def cleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who calls this method?
|
@zsxwing Most documentations on phantom reference suggests a separate daemon thread to do the cleanup. I can try adding the phantom reference removal from the queue in the same function without a thread. Would that work? |
The problem is it will create one thread for each PythonBroadcast. If there are hundreds of |
|
@zsxwing how about creating one thread and the reference queue would contain all instances of the file object that needs to be GC'ed? Start a global thread? |
|
@zsxwing Should i modify the code to use one thread? |
|
@GayathriMurali actually, I'm not sure if it will work. E.g., you need to create a global thread, but when to stop that? |
|
@zsxwing when the queue is empty? Add a listener to the queue to invoke or stop the thread accordingly? Would this approach still make Phantom reference more beneficial than finalize()? |
Right. I'm thinking about it. Actually, it looks much more complicated than it was thought. Maybe let's just keep it unchanged until someone complaints the bad performance. |
|
Why not use a finalizer? this is looking pretty complex with new threads and reference schemes otherwise |
|
@srowen The JIRA was created by @davies and the intent was to replace finalize() with Phantom Reference. http://resources.ej-technologies.com/jprofiler/help/doc/index.html |
|
I think this is the wrong solution. |
|
The finalize() here is to cleanup the disk file for Python broadcasts, is the regular path. @srowen This change is requested by @rxin, to avoid blocking operations in finalize(). Since the Python broadcasts are rarely used and exists() and delete() should be lightly system calls, I'm fine with current finalizer. |
|
Another way to avoid blocking in To the extent they're rarely used, is this a problem? My concern is that you're simply reimplementing a separate 'finalizer' queue with all the associated complexity. |
|
The problem is blocking GC threads ... |
|
Yes, I certainly understand that. |
|
Yea we could do that too. Actually explicit management is probably better, with a fallback to do implicit management so it is robust against mem leaks. That is provided somebody has the cycles to do it. |
|
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket. |
What changes were proposed in this pull request?
Replace finalize() method in PythonBroadcast with Phantom Reference. Redesigned some portions of the code to better handle the thread to avoid thread leak. Fixed existing null pointer exceptions. Introduced a new thread class and a phantom reference class.
How was the this patch tested?
build/sbt "test-only org.apache.spark.api.python.*" - Passes all tests