Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Jun 15, 2015

Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

@srowen
Copy link
Member

srowen commented Jun 15, 2015

This is built into the assembly though, right?

@zsxwing
Copy link
Member Author

zsxwing commented Jun 15, 2015

This is built into the assembly though, right?

No. Spark Flume Sink does not assemble the dependencies. Actually, now we don't have an assembly jar for Flume.

@srowen
Copy link
Member

srowen commented Jun 15, 2015

OK but surely it's easier to make an assembly target than tell people they have to piece together the dependencies and keep updating docs about it?

@zsxwing
Copy link
Member Author

zsxwing commented Jun 15, 2015

OK but surely it's easier to make an assembly target than tell people they have to piece together the dependencies and keep updating docs about it?

ping @tdas about the assembly idea.

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34936 has finished for PR 6829 at commit f8617f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Jun 17, 2015

I think the assembly is a good idea. Though for that we will have to

  1. publish the assembly JAR instead of package JAR. It will be cumbersome to add additional flume-sink-assembly directory for this. Maybe we can make the existing flume-sink project generate and publish the assembly instead of the package.
  2. Update instructions. I am not sure this will much impact existing deployments, because they are supposed to download and run the version of sink that is necessary for the version of Spark they are running.

@harishreedharan What do you think about this.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor

(I'm retesting this to see whether our new Jenkins PRB script is properly skipping the tests for doc-only changes)

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35074 has finished for PR 6829 at commit f8617f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@harishreedharan
Copy link
Contributor

+1 on creating assembly. I am not entirely sure what it takes to generate the assembly, but if it is possible to add that to the current sink module, that would be great. I doubt this would affect any existing deployments in any way.

@tdas
Copy link
Contributor

tdas commented Jun 18, 2015

@zsxwing Then lets try to build an assembly. But for the benefit of branch-1.4 I am going to merge this PR to master and 1.4 (so that the docs are updated for 1.4.1). But then lets create a separate JIRA and PR for the assembly.

asfgit pushed a commit that referenced this pull request Jun 18, 2015
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

Author: zsxwing <[email protected]>

Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc

(cherry picked from commit 24e5379)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in 24e5379 Jun 18, 2015
@zsxwing
Copy link
Member Author

zsxwing commented Jun 19, 2015

@srowen do you have an example to publish both the single jar and the assembly jar?

Two approaches I'm thinking about:

  1. Upload the assembly jar in the same artifact: flume-sink.
  2. Create a new artifact for the assembly jar just like the kafka-assembly module. And we need to generate a new pom.xml for the assembly jar.

I prefer 1 because I don't know how to implement 2 in maven. What do you think?

@zsxwing zsxwing deleted the flume-sink-dep branch June 19, 2015 01:58
@tdas
Copy link
Contributor

tdas commented Jun 19, 2015

Actually, I dont think we can publish two artifacts from same project. Nor
do we need to. No one needs to the single jar for anything, only the
assembly jar. Rather we should publish just publish the assembly. May be
maintain the same name just publish the assembly jar instead.

On Thu, Jun 18, 2015 at 6:59 PM, Shixiong Zhu [email protected]
wrote:

@srowen https://github.com/srowen do you have an example to publish
both the single jar and the assembly jar?

Two approaches I'm thinking about:

  1. Upload the assembly jar in the same artifact: flume-sink.
  2. Create a new artifact for the assembly jar just like the kafka-assembly
    module. And we need to generate a new pom.xml for the assembly jar.

I prefer 1 because I don't know how to implement 2 in maven. What do you
think?


Reply to this email directly or view it on GitHub
#6829 (comment).

@zsxwing
Copy link
Member Author

zsxwing commented Jun 19, 2015

Publishing the single jar would be helpful if people find some dependency conflicts or want to upgrade the version of a dependency library, and want to resolve it by themselves.

I think people won't use the assembly jar in a pom.xml. So I think we can publish the assembly jar under the same artifact.

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since apache#5703. This PR updates the doc for it.

Author: zsxwing <[email protected]>

Closes apache#6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc
@srowen
Copy link
Member

srowen commented Jun 19, 2015

Oh, I didn't realize you wanted to publish the assembly JAR. I don't think it makes sense to publish assemblies as Maven artifacts. Right? anyone that uses it via Maven does not want an assembly and it causes a bunch of problems. So no please don't publish the assembly that way.

You just need a target to build the assembly right? that's just a matter of adding a plugin.

(You can publish multiple artifacts under different classifiers for one group/artifact but this isn't the situation that this would be used.)

@tdas
Copy link
Contributor

tdas commented Jun 19, 2015

The idea for publishing the assembly (which BTW is not that big as it
should contain only the sink code + scala + common lang 3) is same as that
of current publishing of package JAR - so that people can directly download
the JAR and put it in the Flume to run the sink. So the only difference
compared to what is the current behavior is that instead of publishing just
sink code, we will publish sink code + scala + common lang 3, which makes
it more self contained.

On Fri, Jun 19, 2015 at 12:52 AM, Sean Owen [email protected]
wrote:

Oh, I didn't realize you wanted to publish the assembly JAR. I don't think
it makes sense to publish assemblies as Maven artifacts. Right? anyone that
uses it via Maven does not want an assembly and it causes a bunch of
problems. So no please don't publish the assembly that way.

You just need a target to build the assembly right? that's just a matter
of adding a plugin.

(You can publish multiple artifacts under different classifiers for one
group/artifact but this isn't the situation that this would be used.)


Reply to this email directly or view it on GitHub
#6829 (comment).

@srowen
Copy link
Member

srowen commented Jun 19, 2015

Yeah I'm not worried about size. Maven isn't really the right place to distribute assemblies as it's not something to depend on. Yes, it just should be downloadable. I get it, that maven artifacts are still a pretty easy way to make it available. In that case maybe a new module makes sense after all: flume-assembly. Or if you're saying nobody would ever use the existing no-assembly artifact anyway, I can see changing it to the assembly. But that is suggesting that the existing module would never be used as a dependency.

@harishreedharan
Copy link
Contributor

So the only reason for an assembly would be to add commons-lang3. I am more in favor of removing that dependency than making the build more complex. Flume already has scala in classpath (since that is pulled in by the Kafka dependency). I am inclined to keep this component as simple as possible and depend only on stuff already pulled in by Flume into its own classpath anyway.

@JoshRosen
Copy link
Contributor

How extensive is our use of commons Lang 3 in flume sink? If we only use one class or method maybe we can just copy the source into our repository, depending on how complex or large it is.

@harishreedharan
Copy link
Contributor

This is the only usage:
private val seqBase = RandomStringUtils.randomAlphanumeric(8)

I am inclined to just copy the method into the class.

@harishreedharan
Copy link
Contributor

I will open a PR later today for this one.

@tdas
Copy link
Contributor

tdas commented Jun 19, 2015

Yep, lets just remove the dependency on common lang. However, what is the
version of Scala that comes with Kafka? Scala 2.10, or 2.11? This is
something we need to document. Since spark artifacts are now published for
both 2.10 and 2.11, this needs a bit of documentation.

On a related note, should we bump the flume support to latest Flume? It
seems to be stuck in 1.4.0 where as 1.6.0 is out. Does it matter?

On Fri, Jun 19, 2015 at 10:04 AM, Hari Shreedharan <[email protected]

wrote:

I will open a PR later today for this one.


Reply to this email directly or view it on GitHub
#6829 (comment).

@harishreedharan
Copy link
Contributor

Kafka brings in 2.10.

@tdas
Copy link
Contributor

tdas commented Jun 19, 2015

I dont see a dependency on Kafka in Flume 1.4.0
http://mvnrepository.com/artifact/org.apache.flume/flume-ng-sdk/1.4.0

What am i missing?

On Fri, Jun 19, 2015 at 2:21 PM, Hari Shreedharan [email protected]
wrote:

Kafka brings in 2.10.


Reply to this email directly or view it on GitHub
#6829 (comment).

@srowen
Copy link
Member

srowen commented Jun 19, 2015

That looks like just the API module. I suspect it comes via the actual implementation such as in http://mvnrepository.com/artifact/org.apache.flume/flume-ng-sources/1.6.0 but I don't know Flume well.

@tdas
Copy link
Contributor

tdas commented Jun 19, 2015

I see. So the Kafka is present only through the flume-kafka-source
http://mvnrepository.com/artifact/org.apache.flume.flume-ng-sources/flume-kafka-source/1.6.0

Furthermore this is not available for Flume 1.4.0 as kafka source was added
only in 1.6.0

So here are two questions

  1. Do installations of Flume always have all the sources loaded? If not,
    then its an incorrect assumption that Scala will always be present.
  2. Even if 1 is true, we have to upgrade Flume in Spark Streaming to
    version 1.6.0 for this to be feasible. That;s a whole different issue.

I dont know enough about Flume, but I will be very surprised if the kafka
source is always loaded in the classpath in all flume installations.

@harishreedharan please comment.

On Fri, Jun 19, 2015 at 2:50 PM, Sean Owen [email protected] wrote:

That looks like just the API module. I suspect it comes via the actual
implementation such as in
http://mvnrepository.com/artifact/org.apache.flume/flume-ng-sources/1.6.0
but I don't know Flume well.


Reply to this email directly or view it on GitHub
#6829 (comment).

@harishreedharan
Copy link
Contributor

Yes, all of the libs in the flume-ng/lib directory gets added to the classpath, so scala would get added to the classpath, but get loaded only as required (which is normal JVM protocol). We'd have to bump our dependency to 1.6.0 for scala to be automagically available.

Even if we don't upgrade, we don't need to change the dependency set, as the behavior is the same as before (add scala to flume-ng/lib or plugins dir). Apart from the assembly part, nothing else changes. I am sending a PR soon to get rid of the commons-lang3 dependency anyway

@zsxwing
Copy link
Member Author

zsxwing commented Jun 23, 2015

I agree with @srowen that Maven isn't really the right place to distribute assemblies. For the assembly jars, what we need to do is just providing download links for people.

I find now these jars are already assembled in spark-examples-....jar. How bout adding separate assembly jars to the final distribution "tgz" file?

@harishreedharan
Copy link
Contributor

Now that we don't have the commons-lang3 dependency in the flume-sink anymore, the assembly question for this module is moot. But if we want to have a more general discussion, we should perhaps move this discussion to a jira or the dev list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants