Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

In a nutshell, it looks like the absence of ML / MLlib classes on the classpath causes code in KryoSerializer to throw and catch ClassNotFoundExceptions whenever instantiating a new serializer in newInstance(). This isn't a performance problem in production (since MLlib is on the classpath there) but it's a huge issue in tests and appears to account for an enormous amount of test time

We can address this problem by reducing the total number of ClassNotFoundExceptions by performing the class existence checks once and storing the results in KryoSerializer instances rather than repeating the checks on each newInstance() call.

How was this patch tested?

The existing tests.

Authored-by: Josh Rosen [email protected]

@gatorsmile
Copy link
Member Author

cc @JoshRosen


// classForName() is expensive in case the class is not found, so we filter the list of
// SQL / ML / MLlib classes once and then re-use that filtered list in newInstance() calls.
private lazy val loadableClasses: Seq[Class[_]] = {
Copy link
Contributor

@JoshRosen JoshRosen Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved into a private[serializer] field in a object KryoSerializer companion? Now that I look at this again, I'm worried that it'll be serialized as part of KryoSerializer itself, since I think the serializer itself is serialized as part of ShuffleDependency. I don't think that's a huge deal but we could probably shave off some additional work with that extra step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a commit with that change: JoshRosen@c8680f9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just pushed directly to your branch using GitHub's new "allow edits from maintainers" feature. Hope you don't mind 😄!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Feel free to push it.

@gatorsmile
Copy link
Member Author

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jun 20, 2019

Test build #106690 has finished for PR 24916 at commit ae1b642.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • // classForName() is expensive in case the class is not found, so we filter the list of

@SparkQA
Copy link

SparkQA commented Jun 20, 2019

Test build #106694 has finished for PR 24916 at commit c8680f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • // classForName() is expensive in case the class is not found, so we filter the list of

@JoshRosen
Copy link
Contributor

Merged to master. Thanks @gatorsmile!

@JoshRosen JoshRosen closed this in ec032ce Jun 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants