From 737c1fc67ab94c8ec65dad4af053895ede9f2dbe Mon Sep 17 00:00:00 2001 From: Marcelo Vanzin Date: Wed, 23 Sep 2015 15:03:18 -0700 Subject: [PATCH] RFC: Removing assemblies from Spark. --- docs/rfc-no-assemblies.md | 154 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 154 insertions(+) create mode 100644 docs/rfc-no-assemblies.md diff --git a/docs/rfc-no-assemblies.md b/docs/rfc-no-assemblies.md new file mode 100644 index 0000000000000..bc8598b65ca15 --- /dev/null +++ b/docs/rfc-no-assemblies.md @@ -0,0 +1,154 @@ +# Replacing the Spark Assembly with good old jars + +Spark, since the 1.0 release (at least), uses assemblies (or “fat jars”) as the approach to deliver +the shared Spark code to users. In this document I’ll discuss a few problems this approach causes, +and how avoiding the use of the assemblies makes development and deployment of Spark easier for all. + +What does it solve? And at what cost? + +The first question to ask is: what problems is the assembly solving? + +The assembly provides a convenient package including all the classes needed to run a Spark +application. Theoretically, you can easily move it from one place to the other; easily cache it +somewhere like HDFS for reuse; and easily include it as a library in 3rd-party applications. + +But that’s a very shallow look at things, and ignores a lot of problems caused by the assembly. + +Spark has suffered in the past from problems caused by such a large archive with so many files +(pyspark incompatibilities with large archives created by newer JDKs). That has been solved by +moving to JDK 7, though. + +The assembly makes dependencies very opaque. When a user adds the Spark assembly to an application, +what exactly is he pulling in? Is he inadvertently overriding classes needed by his application with +ones included in the Spark assembly? + +The assembly makes development slower. Many tests need currently need an updated assembly to run +correctly (although SPARK-9284 aims to solve that). Updating a remote cluster is slower than +necessary - even rsyncing such a large archive is not terribly efficient. It slows down the build, +because repackaging the assemblies when files change is not exactly fast. Even deciding that there +is no need to rebuild the assembly takes time. + +And it also does not solve the one problem it was meant to solve: it does not include all +dependencies needed by Spark, because the Datanucleus libraries do not work when included in the +assembly. + +From the point of view of someone trying to embed Spark into their application, things become +trickier still. The assembly is not a published artifact, so what should the user pick up instead? +The recommendation has been to use “provided” dependencies and somehow ship the appropriate Spark +assembly with the user application. But that runs into all the issues above (dependency conflicts et +al), aside from being a very unnatural way to use dependencies when compared to other maven-based +projects. + +Finally, as someone whose work involves packaging Spark as part of a larger distribution, the +assembly creates yet more problems. Because all dependencies are included in one big fat jar, it’s +harder to share libraries that are shipped as part of the distribution. This means packages are +unnecessarily bloated because of the code duplication, and patching becomes harder since now you +have to patch multiple components that ship that code. + +Hacks were added to the Spark build to filter out such dependencies (all the *-provided profiles), +but those are brittle, require constant maintenance and policing, and require non-trivial work to +make sure Spark has all needed libraries at runtime. If you happen to miss a shared dependency, and +you are unlucky enough to have to patch it later on, you just made your work more complicated +because now there are two things to patch. + +## How to replace it? + +Ignoring potential backwards compatibility issues due to code that expects the current layout of a +Spark distribution, getting rid of the Spark assembly should be rather easy. + +With a couple of exceptions that I’ll cover below, there is no code in Spark that actually depends +on the assembly. Whether the code comes from one or two hundred jars, everything just works. So from +the packaging side, all that is needed is, instead of having a single jar file, use maven’s built-in +functionality (and I assume sbt would have something similar) to create a directory with the Spark +jars and all needed dependencies. + +The two parts of the code base that depend on the assembly are: + +* The launcher library; fixing it to include all jars in a directory instead of the assembly is + trivial. +* YARN integration. + +The YARN backend assumes, by default, that there’s nothing Spark-related installed in the cluster. +So when you submit an app to YARN, it will try to upload the jar containing the Spark classes +(normally the assembly) to the cluster. There are config options that can be used to tell the YARN +backend where to find the assembly (e.g. somewhere in HDFS or on the local filesystem of cluster +nodes), but those configs assume that Spark is a single file. This is already an issue today when +trying to run a Spark application that needs Datanucleus jars in cluster mode. + +Fixing this is not hard, it just requires a little more code. The YARN backend should be able to +handle directories / globs as well as the current “single jar” approach to uploading (or +referencing) dependencies. + +Spark has more than one assembly, though, so we need to look at how the other assemblies are used +too. + +The examples assembly can receive a similar treatment. The run-examples script might need some +tweaking to include the extra jars in the Spark command to run. And running the examples by using +spark-submit directly might become a little bit more complicated - although even that is fixable, in +certain cases. The dependencies can be added to the example jar’s manifest, and spark-submit could +read the manifest and automatically include the jars in the running application. + +The streaming backend assemblies could potentially just be removed. With the ivy integration in +Spark, the original artifacts can be used instead, and dependencies will be automatically handled. +For those using maven to build their streaming applications, including the dependencies is also +easy. To help with tidiness, the streaming backends should declare Spark dependencies such as +spark-core and spark-streaming as provided. There might be some tweaking needed to get the pyspark +streaming tests to work, since they currently depend on the backend assemblies being built. One last +thing that needs to be covered are python unit tests; they use the assemblies to avoid having to +deal with maven / sbt to build their classpath. This could also be easily supported by having the +dependencies be copied to a known directory under the backend’s build directory - not much different +from how things work today. + +That leaves the YARN shuffle service. This is the only module where I see an assembly really adding +some benefit - deploying / updating the shuffle service on YARN is just a matter of copying / +replacing a single file (aside from configuration). + + +## Summary of benefits + +Removing the assembly brings forward the following benefits: + +* Builds are faster +* Build code is simplified +* Spark behaves more like usual maven-based applications w.r.t. building, packaging and deployment +* Possibility of minor code cleanups in other parts of the code base (launch scripts, code that + starts Spark processes) +* More flexibility when embedding Spark into other applications + +The cons of such a move are: + +* Backwards compatibility, in case someone really depends on the assembly being there. We can have a + dummy jar, but that only solves a trivial part of the compatibility problem. +* Running examples via spark-submit directly might become a little more complicated. +* Slightly more complicated code in the YARN backend, to deal with uploading all dependencies + (instead of just a single file). + + +## What about the “provided” profiles? + +This change would allow most of the “provided” profiles to become obsolete. The only profile that +should be kept is “hadoop-provided”, since it allows users to easily deploy a Spark package on top +of any Hadoop distribution. + +The other profiles mostly cover avoiding repackaging dependencies for examples (which is not +crucial) and streaming backends (which would be handled by the suggestions made in the discussion +above). + +## Links to Assembly-related issues + +* https://issues.apache.org/jira/browse/OOZIE-2277 + +Oozie needs to do a lot of gymnastics to get Spark deployed because of the way it needs to run apps. +Since the Spark assembly is not a maven artifact, it’s unrealistic for Oozie to use it. + +* https://issues.apache.org/jira/browse/PIG-4667 + +Similar to Oozie. When not using an assembly, the YARN backend does the wrong thing (since it will +just upload the spark-yarn jar). The result is to either depend on the non-existent assembly +artifact, or do what Oozie does. + +* https://issues.apache.org/jira/browse/HIVE-7292 + +Link is to the umbrella tracking the project; but Hive-on-Spark solves the same problem in yet +another different way. It would be much simpler if Hive could just depend on Spark directly instead +of somehow having to embed a Spark installation in it.