Skip to content

Conversation

@lresende
Copy link
Member

@lresende lresende commented Sep 6, 2016

What changes were proposed in this pull request?

This PR removes Kinesis from the release scripts as Kinesis license (Amazon Software License) is classified as category-x by Apache Legal and thus not allowed in Apache software projects releases.

How was this patch tested?

Automated builds

@srowen
Copy link
Member

srowen commented Sep 6, 2016

@lresende I don't think we actually distribute the kinesis code, because that profile just enables the kinesis assembly, and that does not cause this stuff to get built into a Spark jar that's distributed.

I traced through the POMs and indeed it doesn't seem to do anything except build this stand-alone assembly. I also downloaded the last release and grepped everything including JARs and found no kinesis code.

What I don't know is why this (and the ganglia profile) are enabled at all here. They're a no-op in the best case, and, seems like an accident waiting to happen. I'd agree with removing both. Paging @pwendell who might remember a reason they're there, and/or @vanzin

@vanzin
Copy link
Contributor

vanzin commented Sep 6, 2016

They're not in the assembly, but they're published to maven as a separate thing; can that be an issue?
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kinesis-asl_2.10

@srowen
Copy link
Member

srowen commented Sep 6, 2016

Aha, right. That one is probably OK as there's technically no third-party code distributed in those artifacts. However, this won't be OK: https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kinesis-asl-assembly_2.10 It's a bit of a gray area since this isn't an ASF release from apache.org, but a convenience binary. However I imagine that's still not quite OK. I think we'd have to stop publishing this.

@lresende
Copy link
Member Author

lresende commented Sep 6, 2016

@srowen, The Kinesis assembly has been published by Spark releases for a while. Here is the link to the 2.0 release on repository.apache.org :
https://repository.apache.org/service/local/repositories/releases/content/org/apache/spark/spark-streaming-kinesis-asl-assembly_2.11/2.0.0/spark-streaming-kinesis-asl-assembly_2.11-2.0.0.jar

@lresende
Copy link
Member Author

lresende commented Sep 6, 2016

As for the the Ganglia one, I will create another jira, to track that separately as this (Kinesis) one might involve more changes around the python and examples.

@srowen
Copy link
Member

srowen commented Sep 6, 2016

Leave the ganglia one because that artifact isn't an assembly after all. Publishing the Spark part of that without redistributing the rest is OK.

Really, here we need to only pull the kinesis assembly, and not the Spark kinesis artifact itself. Removing this profile does both. I think instead it has to be special-cased elsewhere as an artifact that can't be published?

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #65008 has finished for PR 14981 at commit 375d62e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lresende
Copy link
Member Author

lresende commented Sep 6, 2016

Spark kinesis has dependency on the kinesis client which is category-x

com.amazonaws
amazon-kinesis-client
${aws.kinesis.client.version}

Thus I would say we should also avoid to automatically publish it in a release and require a user to understand the license and build it themselves, at least this was my interpretation around what was discussed in https://issues.apache.org/jira/browse/LEGAL-198 and also at http://www.apache.org/legal/resolved.html#optional.

Based on this, excluding the profile from published-profiles avoid publishing both during a Spark release, but should not affect any of the other build pieces for now.

Thoughts ?

@srowen
Copy link
Member

srowen commented Sep 7, 2016

Good question. The Kinesis (non-assembly) artifact does not itself bundle any Amazon-licensed code. However it of course strongly depends on it.

But, the Kinesis artifact itself is optional with respect to Spark itself. So is it OK for an Apache project to publish an optional component, which depends non-optionally on a component with prohibited license?

The Flink part of LEGAL-198 concerned repackaging an alternate build of the Amazon Kinesis lib, which isn't quite the situation here.

I'm not against just not publishing any of this if it's a gray area. Is anyone willing to take it up with legal@ for a definitive answer? if not then yes in the end I agree with the change here.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65107 has finished for PR 14981 at commit 6e3fec4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 9, 2016

@vanzin and possibly @koeninger -- so, if we can't publish the Kinesis assembly, is there a purpose in having the module exist in the repo? would someone want to build that assembly from source? pardon if I've overlooked the obvious here.

If it exists only for publishing, then we could certainly just remove it to solve the immediate problem.

Whether the rest can be published still is part of LEGAL-198, and I'll follow that up with legal@ as needed.

@koeninger
Copy link
Contributor

My 2 cents are that we should make things as easy as possible for users, within the bounds of what ASF legal is willing to tolerate ;) Which probably means having it exist, but not published to mvn.

@srowen
Copy link
Member

srowen commented Sep 9, 2016

Agree with that @koeninger , mostly asking if the assembly module has any use if it's not published?

@koeninger
Copy link
Contributor

Isn't that mostly down to whether someone wants to just put the whole assembly on their classpath, vs install the project and depend on it in their build tool? I can see why someone would want the whole assembly, but don't think it's a big deal one way or the other.

In either case, shouldn't the modified instructions say something like

mvn package -Pkinesis-asl -pl external/kinesis-asl

there's no reason to rebuild all of spark just for that jar

@srowen
Copy link
Member

srowen commented Sep 9, 2016

I presume the best practice is certainly to build your deps into your app, and simplicity is good, all else equal. I suppose I'd remove this module but don't feel strongly about it. Removing it entirely solves this problem.

If it stays, then it has to be something that isn't published, and I'm not sure there's a clean way to do it other than just special-casing it in the release script. It's a tiny reason to consider removing it.

@koeninger
Copy link
Contributor

Yeah, if it's cleaner to remove the kinesis-asl-assembly module I don't think it's a serious hardship to users.

@vanzin
Copy link
Contributor

vanzin commented Sep 9, 2016

The assembly is mostly a convenience; one jar to be provided in the command line vs. multiple (and figuring out what those are).

For those packaging it with their app, maven takes care of things. For those not, the assembly provides some benefit. But that's mostly if it's easily available. If the users have to build it, then it's probably just easier to use --packages and point at the non-assembly artifact.

@lresende
Copy link
Member Author

lresende commented Sep 9, 2016

I would still wait for the feedback from legal before removing anything.

@srowen
Copy link
Member

srowen commented Sep 11, 2016

I personally think it's fine to go ahead and remove the Kinesis assembly module, because we know that can't be distributed. Separately, yes, the question is whether the non-assembly Ganglia and Kinesis modules are OK.

It's interesting to compare to the situation with the 'netlib-lgpl' profile. This is actually OK because there is no code at all in the project that links to any LGPL dependency here. It's entirely optional and provided at runtime.

@lresende
Copy link
Member Author

@srowen should I update this PR with the removal of kinesis assembly then ?

@srowen
Copy link
Member

srowen commented Sep 13, 2016

@lresende yes I think so. I think we can consider SPARK-17418 about the assembly, and SPARK-17422 about (possibly) removing the other artifacts depending on LEGAL-198.

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65345 has finished for PR 14981 at commit 07eb037.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

MVN="build/mvn --force"
PUBLISH_PROFILES="-Pmesos -Pyarn -Phive -Phive-thriftserver -Phadoop-2.2"
PUBLISH_PROFILES="$PUBLISH_PROFILES -Pspark-ganglia-lgpl -Pkinesis-asl"
PUBLISH_PROFILES="$PUBLISH_PROFILES -Pspark-ganglia-lgpl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this moment, I think you still need this profile, because it causes the non-assembly module to be built and published.

@srowen
Copy link
Member

srowen commented Sep 14, 2016

Ah. The assembly is needed for Pyspark Kinesis tests. Not sure if you would know @koeninger but is there any easy way around that?

I'd say we can instead look at modifying the release script to just not upload it. I think that's still OK to the extent that we are allowed at all to distribute source that depends on a Category X license.

@koeninger
Copy link
Contributor

Yeah, I don't know of an easier workaround.

My understanding is also that the asf concern is tied to distribution, so
not publishing to maven should be sufficient.

On Sep 14, 2016 3:22 AM, "Sean Owen" [email protected] wrote:

Ah. The assembly is needed for Pyspark Kinesis tests. Not sure if you
would know @koeninger https://github.com/koeninger but is there any
easy way around that?

I'd say we can instead look at modifying the release script to just not
upload it. I think that's still OK to the extent that we are allowed at all
to distribute source that depends on a Category X license.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14981 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGAB2ybdIdY2TjizJ1UzYF51v5Zf7oxks5qp67SgaJpZM4J2N4T
.

@lresende
Copy link
Member Author

Ok, reverting the commit to remove kinesis assembly as the python tests are relying on it for the transient dependencies. Note that I was also trying to overcome this requirement by appending all the required dependencies in jars, but there seems to have other issues, in any case, these other experiments are available in the following branch https://github.com/lresende/spark/tree/remove-kinesis-assembly.

@srowen
Copy link
Member

srowen commented Sep 16, 2016

Yeah, it's either going to be turning off the profile entirely (if we can't distribute the non-assembly artifact), or leaving it on but manually excluding the assembly artifact.

Kinesis license (Amazon Software License) is classified as category-x
and is not allowed in Apache software projects releases.
@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65486 has finished for PR 14981 at commit 6e3fec4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65488 has finished for PR 14981 at commit e892004.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 17, 2016

CC @rxin I think the direct but slightly hacky way to just address this issue is to modify release-build.sh around here ...

  # Remove any extra files generated during install
  find . -type f |grep -v \.jar |grep -v \.pom | xargs rm

  # Specially remove the Kinesis assembly so that it's not published
  rm -r external/kinesis-asl-assembly/target/*

  echo "Creating hash and signature files"

I can't say I 100% know that works but it looks right.

@lresende
Copy link
Member Author

lresende commented Sep 17, 2016

@srowen @rxin My understanding is that the mvn deploy is what takes care of actually publishing the files to maven staging repository :
`
$MVN -DzincPort=$ZINC_PORT --settings $tmp_settings -DskipTests $PUBLISH_PROFILES deploy

./dev/change-scala-version.sh 2.10
$MVN -DzincPort=$ZINC_PORT -Dscala-2.10 --settings $tmp_settings -DskipTests $PUBLISH_PROFILES clean deploy
`

So, the suggested fix to remove Kinesis from the $PUBLISH_PROFILES should take care or making sure Kinesis won't show up in the maven staging repository for the release.

@srowen Do you have other concerns ?

@srowen
Copy link
Member

srowen commented Sep 17, 2016

The issue is that this also removes the non assembly artifact from the release. That does not seem to be strictly needed license wise. It is easy and tidy though.

@lresende
Copy link
Member Author

Yes, and this is the intent. It's ok to have these in the source release (similar to ganglia) but we don't publish them in maven repository and it becomes available only if people goes and directly build them locally.

@srowen
Copy link
Member

srowen commented Sep 17, 2016

That isn't the conclusion I took from the discussion on legal-discuss - do you have a pointer? I took that it was at best ambiguous but not obviously prohibited to distribute these because they are optional wrt Spark.

@lresende
Copy link
Member Author

The pointer is exactly your quote on the e-mail to legal-discuss:

http://www.apache.org/legal/resolved.html#prohibited says:

CAN APACHE PROJECTS RELY ON COMPONENTS UNDER PROHIBITED LICENSES?
Apache projects cannot distribute any such components. As with the previous question on platforms, the component can be relied on if the component's licence terms do not affect the Apache product's licensing. For example, using a GPL'ed tool during the build is OK.

CAN APACHE PROJECTS RELY ON COMPONENTS WHOSE LICENSING AFFECTS THE APACHE PRODUCT?
Apache projects cannot distribute any such components. However, if the component is only needed for optional features, a project can provide the user with instructions on how to obtain and install the non-included work. Optional means that the component is not required for standard use of the product or for the product to achieve a desirable level of quality. The question to ask yourself in this situation is:

And I am being flexible here, and agreeing that that is ok to have the source distribution with the kinesis and ganglia modules, as long as we don't publish them into maven and require the users to build with the respective profiles in order to gain access to these modules in their application.

@srowen
Copy link
Member

srowen commented Sep 17, 2016

I am referring to http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201609.mbox/<CAN5cbe4pD6XOeGmBJ35bpY-4yJsSTHKEpFEHBM8m8fsrP6B8hw%40mail.gmail.com>

I don't think it is up to us being 'flexible' or not. I also don't actually see that a source vs binary distinction is drawn here either. Indeed there is a question whether even that is permitted.

But I do not see any conclusive argument that this isn't permitted.

@srowen
Copy link
Member

srowen commented Sep 18, 2016

Yeah #14981 (comment) is certainly the argument against including them, that's clear.

But I also outlined the argument 'for': you could also view the publishing of Maven artifacts as just the 'instructions on how to obtain and install the non-included work'. On its face, it seems to be about something else; it seems to mean that ASF projects may simply provide instructions about getting the Kinesis client. But how Spark supposed to optionally work with Kinesis client with zero integration code? This is only true in quite rare cases (like, netlib-java!). This doesn't sound like the intent, to forbid any integration code in a project, as long as the integration itself is optional. And it is.

I want to be pretty sure there's no decent argument for keeping the artifacts, because stopping publishing them affects users.

The ongoing thread has no authoritative outcome. The 'safe' thing to do from a policy perspective is not to distribute these artifacts, I suppose. I admit, I myself am neutral, because I have no particular interest in these integrations. However, I also don't see anything here that suggests binary is different from source. Source is actually the authoritative artifact from the ASF, so "only" publishing source doesn't change things. That is, we'd be arguing that the whole module has to be deleted.

If it must be, so be it, but I don't yet see it must be.

What we do know is that the kinesis assembly has to go. If we're about to release 2.0.1, we must do that. I suggest we not block 2.0.1 on further removal of artifacts unless an authoritative position emerges before its release that says they must go.

@lresende
Copy link
Member Author

@srowen Please don't get me wrong, I don't have any interest on this extension either, but just want to make sure we start doing the right thing for Apache Spark. I will try to ping some of the long time Apache members on the legal-discussion and see if we can get a definitive/authoritative answer for this particular case.

@JoshRosen
Copy link
Contributor

Has there been any authoritative answer on whether we are prohibited from publishing non-assembly Spark Kinesis artifacts to Maven? I read through the thread on legal-discuss but it looks like there's still not a clear outcome as of Sunday the 18th.

@srowen
Copy link
Member

srowen commented Sep 20, 2016

Not that I know of. I believe it's reasonable to say that there's a valid open question, and a reasonable argument at this stage that the non-assembly artifacts are allowed. If a 2.0.1 release is happening shortly I don't think it should block.

however the assembly still has to go at least, and the possibilities are the 'hack' I mentioned above, or somehow modifying the Python tests to not need the kinesis assembly, because that would mean the whole assembly module could go.

@JoshRosen
Copy link
Contributor

In the past I came very close to being able to remove the Python tests' dependencies on the streaming assemblies, so I'm going to see whether I can revive any of those old patches to completely remove the assembly module.

@JoshRosen
Copy link
Contributor

After some investigation I think that it's going to be pretty hard to remove the use of the assembly in PySpark streaming tests and definitely a change outside of the scope of what we'd want to do in a point release.

I'm going to take a look to see how much work it would be to prevent the publication of only the assembly artifact in case it turns out that we're able to continue publishing the non-assembly module.

@JoshRosen
Copy link
Contributor

I've gone ahead and opened #15167, which only prevents the publication of the assembly but continues to publish the non-assembly artifact.

@lresende lresende closed this Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants