-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28112] [TEST] Fix Kryo exception perf. bottleneck in tests due to absence of ML/MLlib classes #24916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @JoshRosen |
|
|
||
| // classForName() is expensive in case the class is not found, so we filter the list of | ||
| // SQL / ML / MLlib classes once and then re-use that filtered list in newInstance() calls. | ||
| private lazy val loadableClasses: Seq[Class[_]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be moved into a private[serializer] field in a object KryoSerializer companion? Now that I look at this again, I'm worried that it'll be serialized as part of KryoSerializer itself, since I think the serializer itself is serialized as part of ShuffleDependency. I don't think that's a huge deal but we could probably shave off some additional work with that extra step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a commit with that change: JoshRosen@c8680f9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just pushed directly to your branch using GitHub's new "allow edits from maintainers" feature. Hope you don't mind 😄!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Feel free to push it.
|
LGTM pending Jenkins. |
|
Test build #106690 has finished for PR 24916 at commit
|
|
Test build #106694 has finished for PR 24916 at commit
|
|
Merged to master. Thanks @gatorsmile! |
What changes were proposed in this pull request?
In a nutshell, it looks like the absence of ML / MLlib classes on the classpath causes code in KryoSerializer to throw and catch ClassNotFoundExceptions whenever instantiating a new serializer in newInstance(). This isn't a performance problem in production (since MLlib is on the classpath there) but it's a huge issue in tests and appears to account for an enormous amount of test time
We can address this problem by reducing the total number of ClassNotFoundExceptions by performing the class existence checks once and storing the results in KryoSerializer instances rather than repeating the checks on each newInstance() call.
How was this patch tested?
The existing tests.
Authored-by: Josh Rosen [email protected]