-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17418] Remove Kinesis artifacts from Spark release scripts #14981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@lresende I don't think we actually distribute the kinesis code, because that profile just enables the kinesis assembly, and that does not cause this stuff to get built into a Spark jar that's distributed. I traced through the POMs and indeed it doesn't seem to do anything except build this stand-alone assembly. I also downloaded the last release and grepped everything including JARs and found no kinesis code. What I don't know is why this (and the ganglia profile) are enabled at all here. They're a no-op in the best case, and, seems like an accident waiting to happen. I'd agree with removing both. Paging @pwendell who might remember a reason they're there, and/or @vanzin |
|
They're not in the assembly, but they're published to maven as a separate thing; can that be an issue? |
|
Aha, right. That one is probably OK as there's technically no third-party code distributed in those artifacts. However, this won't be OK: https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kinesis-asl-assembly_2.10 It's a bit of a gray area since this isn't an ASF release from apache.org, but a convenience binary. However I imagine that's still not quite OK. I think we'd have to stop publishing this. |
|
@srowen, The Kinesis assembly has been published by Spark releases for a while. Here is the link to the 2.0 release on repository.apache.org : |
|
As for the the Ganglia one, I will create another jira, to track that separately as this (Kinesis) one might involve more changes around the python and examples. |
|
Leave the ganglia one because that artifact isn't an assembly after all. Publishing the Spark part of that without redistributing the rest is OK. Really, here we need to only pull the kinesis assembly, and not the Spark kinesis artifact itself. Removing this profile does both. I think instead it has to be special-cased elsewhere as an artifact that can't be published? |
|
Test build #65008 has finished for PR 14981 at commit
|
|
Spark kinesis has dependency on the kinesis client which is category-x Thus I would say we should also avoid to automatically publish it in a release and require a user to understand the license and build it themselves, at least this was my interpretation around what was discussed in https://issues.apache.org/jira/browse/LEGAL-198 and also at http://www.apache.org/legal/resolved.html#optional. Based on this, excluding the profile from published-profiles avoid publishing both during a Spark release, but should not affect any of the other build pieces for now. Thoughts ? |
|
Good question. The Kinesis (non-assembly) artifact does not itself bundle any Amazon-licensed code. However it of course strongly depends on it. But, the Kinesis artifact itself is optional with respect to Spark itself. So is it OK for an Apache project to publish an optional component, which depends non-optionally on a component with prohibited license? The Flink part of LEGAL-198 concerned repackaging an alternate build of the Amazon Kinesis lib, which isn't quite the situation here. I'm not against just not publishing any of this if it's a gray area. Is anyone willing to take it up with legal@ for a definitive answer? if not then yes in the end I agree with the change here. |
375d62e to
6e3fec4
Compare
|
Test build #65107 has finished for PR 14981 at commit
|
|
@vanzin and possibly @koeninger -- so, if we can't publish the Kinesis assembly, is there a purpose in having the module exist in the repo? would someone want to build that assembly from source? pardon if I've overlooked the obvious here. If it exists only for publishing, then we could certainly just remove it to solve the immediate problem. Whether the rest can be published still is part of LEGAL-198, and I'll follow that up with legal@ as needed. |
|
My 2 cents are that we should make things as easy as possible for users, within the bounds of what ASF legal is willing to tolerate ;) Which probably means having it exist, but not published to mvn. |
|
Agree with that @koeninger , mostly asking if the assembly module has any use if it's not published? |
|
Isn't that mostly down to whether someone wants to just put the whole assembly on their classpath, vs install the project and depend on it in their build tool? I can see why someone would want the whole assembly, but don't think it's a big deal one way or the other. In either case, shouldn't the modified instructions say something like mvn package -Pkinesis-asl -pl external/kinesis-asl there's no reason to rebuild all of spark just for that jar |
|
I presume the best practice is certainly to build your deps into your app, and simplicity is good, all else equal. I suppose I'd remove this module but don't feel strongly about it. Removing it entirely solves this problem. If it stays, then it has to be something that isn't published, and I'm not sure there's a clean way to do it other than just special-casing it in the release script. It's a tiny reason to consider removing it. |
|
Yeah, if it's cleaner to remove the kinesis-asl-assembly module I don't think it's a serious hardship to users. |
|
The assembly is mostly a convenience; one jar to be provided in the command line vs. multiple (and figuring out what those are). For those packaging it with their app, maven takes care of things. For those not, the assembly provides some benefit. But that's mostly if it's easily available. If the users have to build it, then it's probably just easier to use |
|
I would still wait for the feedback from legal before removing anything. |
|
I personally think it's fine to go ahead and remove the Kinesis assembly module, because we know that can't be distributed. Separately, yes, the question is whether the non-assembly Ganglia and Kinesis modules are OK. It's interesting to compare to the situation with the 'netlib-lgpl' profile. This is actually OK because there is no code at all in the project that links to any LGPL dependency here. It's entirely optional and provided at runtime. |
|
@srowen should I update this PR with the removal of kinesis assembly then ? |
|
@lresende yes I think so. I think we can consider SPARK-17418 about the assembly, and SPARK-17422 about (possibly) removing the other artifacts depending on LEGAL-198. |
|
Test build #65345 has finished for PR 14981 at commit
|
| MVN="build/mvn --force" | ||
| PUBLISH_PROFILES="-Pmesos -Pyarn -Phive -Phive-thriftserver -Phadoop-2.2" | ||
| PUBLISH_PROFILES="$PUBLISH_PROFILES -Pspark-ganglia-lgpl -Pkinesis-asl" | ||
| PUBLISH_PROFILES="$PUBLISH_PROFILES -Pspark-ganglia-lgpl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this moment, I think you still need this profile, because it causes the non-assembly module to be built and published.
|
Ah. The assembly is needed for Pyspark Kinesis tests. Not sure if you would know @koeninger but is there any easy way around that? I'd say we can instead look at modifying the release script to just not upload it. I think that's still OK to the extent that we are allowed at all to distribute source that depends on a Category X license. |
|
Yeah, I don't know of an easier workaround. My understanding is also that the asf concern is tied to distribution, so On Sep 14, 2016 3:22 AM, "Sean Owen" [email protected] wrote:
|
07eb037 to
6e3fec4
Compare
|
Ok, reverting the commit to remove kinesis assembly as the python tests are relying on it for the transient dependencies. Note that I was also trying to overcome this requirement by appending all the required dependencies in jars, but there seems to have other issues, in any case, these other experiments are available in the following branch https://github.com/lresende/spark/tree/remove-kinesis-assembly. |
|
Yeah, it's either going to be turning off the profile entirely (if we can't distribute the non-assembly artifact), or leaving it on but manually excluding the assembly artifact. |
Kinesis license (Amazon Software License) is classified as category-x and is not allowed in Apache software projects releases.
6e3fec4 to
e892004
Compare
|
Test build #65486 has finished for PR 14981 at commit
|
|
Test build #65488 has finished for PR 14981 at commit
|
|
CC @rxin I think the direct but slightly hacky way to just address this issue is to modify I can't say I 100% know that works but it looks right. |
|
@srowen @rxin My understanding is that the mvn deploy is what takes care of actually publishing the files to maven staging repository : ./dev/change-scala-version.sh 2.10 So, the suggested fix to remove Kinesis from the $PUBLISH_PROFILES should take care or making sure Kinesis won't show up in the maven staging repository for the release. @srowen Do you have other concerns ? |
|
The issue is that this also removes the non assembly artifact from the release. That does not seem to be strictly needed license wise. It is easy and tidy though. |
|
Yes, and this is the intent. It's ok to have these in the source release (similar to ganglia) but we don't publish them in maven repository and it becomes available only if people goes and directly build them locally. |
|
That isn't the conclusion I took from the discussion on legal-discuss - do you have a pointer? I took that it was at best ambiguous but not obviously prohibited to distribute these because they are optional wrt Spark. |
|
The pointer is exactly your quote on the e-mail to legal-discuss: http://www.apache.org/legal/resolved.html#prohibited says:CAN APACHE PROJECTS RELY ON COMPONENTS UNDER PROHIBITED LICENSES? CAN APACHE PROJECTS RELY ON COMPONENTS WHOSE LICENSING AFFECTS THE APACHE PRODUCT? And I am being flexible here, and agreeing that that is ok to have the source distribution with the kinesis and ganglia modules, as long as we don't publish them into maven and require the users to build with the respective profiles in order to gain access to these modules in their application. |
|
I am referring to http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201609.mbox/<CAN5cbe4pD6XOeGmBJ35bpY-4yJsSTHKEpFEHBM8m8fsrP6B8hw%40mail.gmail.com> I don't think it is up to us being 'flexible' or not. I also don't actually see that a source vs binary distinction is drawn here either. Indeed there is a question whether even that is permitted. But I do not see any conclusive argument that this isn't permitted. |
|
Yeah #14981 (comment) is certainly the argument against including them, that's clear. But I also outlined the argument 'for': you could also view the publishing of Maven artifacts as just the 'instructions on how to obtain and install the non-included work'. On its face, it seems to be about something else; it seems to mean that ASF projects may simply provide instructions about getting the Kinesis client. But how Spark supposed to optionally work with Kinesis client with zero integration code? This is only true in quite rare cases (like, netlib-java!). This doesn't sound like the intent, to forbid any integration code in a project, as long as the integration itself is optional. And it is. I want to be pretty sure there's no decent argument for keeping the artifacts, because stopping publishing them affects users. The ongoing thread has no authoritative outcome. The 'safe' thing to do from a policy perspective is not to distribute these artifacts, I suppose. I admit, I myself am neutral, because I have no particular interest in these integrations. However, I also don't see anything here that suggests binary is different from source. Source is actually the authoritative artifact from the ASF, so "only" publishing source doesn't change things. That is, we'd be arguing that the whole module has to be deleted. If it must be, so be it, but I don't yet see it must be. What we do know is that the kinesis assembly has to go. If we're about to release 2.0.1, we must do that. I suggest we not block 2.0.1 on further removal of artifacts unless an authoritative position emerges before its release that says they must go. |
|
@srowen Please don't get me wrong, I don't have any interest on this extension either, but just want to make sure we start doing the right thing for Apache Spark. I will try to ping some of the long time Apache members on the legal-discussion and see if we can get a definitive/authoritative answer for this particular case. |
|
Has there been any authoritative answer on whether we are prohibited from publishing non-assembly Spark Kinesis artifacts to Maven? I read through the thread on |
|
Not that I know of. I believe it's reasonable to say that there's a valid open question, and a reasonable argument at this stage that the non-assembly artifacts are allowed. If a 2.0.1 release is happening shortly I don't think it should block. however the assembly still has to go at least, and the possibilities are the 'hack' I mentioned above, or somehow modifying the Python tests to not need the kinesis assembly, because that would mean the whole assembly module could go. |
|
In the past I came very close to being able to remove the Python tests' dependencies on the streaming assemblies, so I'm going to see whether I can revive any of those old patches to completely remove the assembly module. |
|
After some investigation I think that it's going to be pretty hard to remove the use of the assembly in PySpark streaming tests and definitely a change outside of the scope of what we'd want to do in a point release. I'm going to take a look to see how much work it would be to prevent the publication of only the assembly artifact in case it turns out that we're able to continue publishing the non-assembly module. |
|
I've gone ahead and opened #15167, which only prevents the publication of the assembly but continues to publish the non-assembly artifact. |
What changes were proposed in this pull request?
This PR removes Kinesis from the release scripts as Kinesis license (Amazon Software License) is classified as category-x by Apache Legal and thus not allowed in Apache software projects releases.
How was this patch tested?
Automated builds