Skip to content

Conversation

@fangshil
Copy link

@fangshil fangshil commented May 15, 2018

What changes were proposed in this pull request?

We should extend Spark's -package option to support:

  1. Turn-off transitive dependency on a given artifact(like spark-avro)
  2. Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar)
  3. Resolving a given artifact with custom ivy conf
  4. Excluding particular transitive dependencies from a given artifact. This can help when artifacts have conflict transitive dependencies. We currently only have top-level exclusion rule applies for all artifacts.

New artifact spec to be reviewed:

Coordinates are split by ','
Each coordinate should be provided in the format groupId:artifactId:version?param1=value1\&param2\&value2:.. or groupId/artifactId:version?param1=value1\&param2=value2:..

Param splitter needs to be escaped as & , or entire coordinates string needs to be enclosed in double quotes, since & is the background process char in cli.

Optional params are 'classifier', 'transitive', 'exclude', 'conf':
classifier: classifier of the artifact
transitive: whether to resolve transitive deps for the artifact
exlude: exclude list of transitive artifacts for this artifact(e.g. "a#b#c")
conf: the ivy conf of the artifact

We have tested this patch internally and it greatly increases the flexibility when user uses -packages option

How was this patch tested?

added unit test

@taklwu
Copy link

taklwu commented Jul 30, 2018

hi there, we also found the same problem, do we have any progress on this patch?

@srowen
Copy link
Member

srowen commented Jul 31, 2018

Hoo boy, I remember trying to make this work a long time ago. See #17416 . I wasn't able to. Let me run tests to see how this goes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #4223 has finished for PR 21339 at commit 30f378f.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@fangshil
Copy link
Author

thanks for the follow up. i will rebase this patch to latest master

@wangyum
Copy link
Member

wangyum commented Nov 10, 2018

ping @fangshil

@fangshil
Copy link
Author

fangshil commented Nov 14, 2018

Lets start the review progress. Rebase to latest master. @srowen could you trigger a new test?

@HeartSaVioR
Copy link
Contributor

I'm interested in this issue, and happy to see there's existing PR to evaluate.
@fangshil Could you explain how you test the patch to verify provided functionalities? It would be better for reviewers to try to construct test scenario for their own.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 25, 2019

I've pulled this patch and played with functionalities, and works well.

I've used hive-exec which has default classifier as well as core classifier: hive-exec with default classifier includes some dependencies into jar (org.json:json is included) which core classifier contains it as dependency (not included in jar).

./bin/spark-shell

scala> import org.json.JSON
<console>:24: error: object json is not a member of package org
       import org.json.JSON

Not able to load since it's not in default Spark classpath. (expected)

./bin/spark-shell --packages "org.apache.hive:hive-exec:3.1.1?classifier=core"

scala> import org.json.JSON
import org.json.JSON

With classifier it properly loads it as transitive dependencies. (works well)

./bin/spark-shell --packages "org.apache.hive:hive-exec:3.1.1?classifier=core&transitive=false"

scala> import org.json.JSON
<console>:23: error: object json is not a member of package org
       import org.json.JSON

It ignores whole transitive dependencies, hence org.json:json is not pulled. (works well)

./bin/spark-shell --packages "org.apache.hive:hive-exec:3.1.1?classifier=core&exclude=json"

scala> import org.json.JSON
<console>:23: error: object json is not a member of package org
       import org.json.JSON

It excludes org.json:json from transitive dependencies. (works well)

./bin/spark-shell --packages "org.apache.hive:hive-exec:3.1.1"

scala> import org.json.JSON
import org.json.JSON

The jar itself contains org.json:json so should be able to load. (This represents classifier works as expected.)

./bin/spark-shell --packages "org.apache.hive:hive-exec:3.1.1?exclude=json"

scala> import org.json.JSON
import org.json.JSON

Even we exclude org.json:json it's still able to load, since the jar contains it. (This represents classifier works as expected.)

FYI I had to fix scalastyle before building. I'll start reviewing the code and leave comments.

cc. @srowen @gaborgsomogyi You might be interested on this PR.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left major comments first. Let me continue taking a look later, maybe in couple of days.

import java.security.PrivilegedExceptionAction
import java.text.ParseException
import java.util.UUID
import java.util.Collections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be placed before import java.util.UUID. Please run ./dev/scalastyle before pushing to branch.

// Exclude dependencies(name separated by #) for this artifact
case "exclude" => pvalue.split("#").foreach { ex =>
dd.addExcludeRule(ivyConfName,
createExclusion("*:*" + ex + "*:*", ivySettings, ivyConfName))}
Copy link
Contributor

@HeartSaVioR HeartSaVioR Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, if we want to exclude org.json:json we should give json as parameter and it will exclude everything which artifact name contains json. This behaves as very misleading: could we just let end users specify details for groupId and artifactId, and let end users explicitly use * if really needed?

@HeartSaVioR
Copy link
Contributor

Also ping @fangshil

def params: String = if (extraParams.isEmpty) {
""
} else {
"?" + extraParams.map{ case (k, v) => (k + "=" + v) }.mkString(":")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't pass the style check I think. "?" + extraParams.map { case (k, v) => k + "=" + v }.mkString(":")
But shouldn't it be mkString("&")?

* @param coordinates Comma-delimited string of maven coordinates
* @return Sequence of Maven coordinates
* Extracts artifact coordinates from a comma-delimited string. Coordinates should be provided
* in the format `groupId:artifactId:version?param1=value1\&param2\&value2:..` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of thought the coordinate was groupId:artifcatId:classifier:version or something but I can't find a reference for that. OK, I guess this is as much as we can do here to make up a new syntax

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this, I would do something like groupId:artifcatId:version:classifier. Having classifier as a param is weird.

* in the format `groupId:artifactId:version?param1=value1\&param2\&value2:..` or
* `groupId/artifactId:version?param1=value1\&param2=value2:..`
*
* Param splitter & is the background process char in cli, so when multiple params is used,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is clear. "Shells usually use & as a reserved character, so escape these characters with \& or double-quote the whole argument if used"?

* Optional params are 'classifier', 'transitive', 'exclude', 'conf':
* classifier: classifier of the artifact
* transitive: whether to resolve transitive deps for the artifact
* exlude: exclude list of transitive artifacts for this artifact(e.g. "a#b#c")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exlude -> exclude
I think this is all becoming overkill. Just add support for classifier. If someone needs something more complex, they shouldn't be using this mechanism.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen In our production environment, we also used 'transitive', 'exclude' and 'conf' options quite often. The most commonly used one is 'transitive'. We think these options are flexible and important to support complex scenarios. Why do you think packages option should not support complex scenarios?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the first glance I think the additional params can serve testing purposes and tend to agree with Sean. I would personally use the added params (except classifier) only when jars are not properly built (different versions in classpath, etc...).

@fangshil
Copy link
Author

@HeartSaVioR thanks for your review! I will address your comments shortly.

I saw you tested 'classifier', 'transitive' and 'exclude' options. @srowen suggested we should only add 'classifier' instead of making it more complex, but in our prod env(using this patch for over 2 years now), we do see a lot of use cases using other options. What do you think?

@srowen
Copy link
Member

srowen commented Mar 25, 2019

That really sounds like a case where you want to have a proper build expressing dependencies of your project, and package it with your app

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 25, 2019

I'm not sure about transitive and conf (might be good to have) but I've been using exclude quite often in other project (DISCLAIMER: Actually I introduced similar functionality there) since transitive dependencies often bring conflicts.

IMHO, once we support --packages option which is a mechanism on dynamic dependency control (instead of uber jar), 'exclude' seems good for completeness of feature. We can still control it via packaging with uber jar, but there're some situations we tend to rely on --packages (spark-shell is one of example, but notebook or submitter tools of Spark may also want to deal with --packages too.)

@srowen
Copy link
Member

srowen commented Mar 25, 2019

This was not meant to replicate the type of dependency handling you get in a build tool. It's a convenience mostly for spark-shell, where you don't necessarily want to package your dependencies into a JAR to run the shell. It use, like the shell's, is more for ad hoc experimentation. If you're dealing with complicated enough dependencies that you need to manage exclusions, you need to have a build for an app.

classifier is a pretty niche use case, but it has come up. Example: corenlp publishes language models as artifacts with classifiers. I can see wanting to actually use them in a simple shell, and you can't right now. Same for (as I recall) some deeplearning4j artifacts that have per-architecture variants.

Anything more though I'm really not sure about.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 26, 2019

Have been thinking this more, and while I still feel supporting exclude would not hurt much (for complexity, maintenance cost), it may not be necessary on the simple cases cause Spark libs will be added to the classpath prior to transitive dependencies being pulled by --packages.

Suppose the case there's conflict between Spark libs and transitive dependencies which would make some problems:

  1. Spark's point of view: Spark will not be bothered by transitive dependencies if it works as expected
  2. dependency's point of view: dependency will be running into problem, but excluding from dependency artifact side will not help since the problematic dependency will be retained in Spark side.

So exclude (and others) more suitable when pulling multiple artifacts with --packages (and also --jars) which transitive dependencies of them should be arranged. If we would want to let --packages only handle simple cases it may be OK, but as we know it wouldn't work for cases which fine-grained control of transitive dependencies is required (even this case exclude only came in my experience, others weren't).

@HeartSaVioR
Copy link
Contributor

BTW I'm curious there was a specific reason/decision around picking the library (Aether vs Ivy), because we tend to support around Maven coordinate (classifier in this case) which is just natively supported in Aether, maybe less painful to struggle about. Maybe I'm missing what Ivy only supports while Aether does not.

@fangshil
Copy link
Author

fangshil commented Mar 26, 2019

We have been supporting --packages internally for more than 2 years, and restricted --packages for ad-hoc testing in shell and notebook. As a typical use case, use starts spark-shell as part of the development process for easy debugging on pieces of the job logic, before assembling a whole prod job through a build tool. In this scenario, supporting a powerful --package similar using to a build tool were a requirement especially for us(a java company). In our experience, the four options we added have covered all the corner case so far.

My thoughts are mostly aligned with @HeartSaVioR, while in addition, I think both the 4 options we proposed are 'good for completeness of this feature'.

@srowen
Copy link
Member

srowen commented Mar 26, 2019

I'm not strongly against this, as the extra complexity isn't high. Mostly I'm concerned that we'll continue to find corner cases and exceptions in this 'support', when this was never meant to be a full package manager.

Why Ivy? I am guessing because that's what Maven uses, or used, and we wanted to align the behavior as much as possible.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 26, 2019

Why Ivy? I am guessing because that's what Maven uses, or used, and we wanted to align the behavior as much as possible.

Why I said Aether vs Ivy is that Aether has been much closer for Maven, and Maven pulls Aether into its codebase: it was EPLv1 and now it renamed to maven-artifact. (Whereas Ivy is integrated to Ant hence now placed to ant-ivy).

https://github.com/apache/maven/tree/master/maven-artifact
https://issues.apache.org/jira/browse/MNG-6007
http://incubator.apache.org/ip-clearance/maven-aether.html

@srowen
Copy link
Member

srowen commented Mar 26, 2019

Maven used Ivy in the past, right? probably at the time this was implemented.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 26, 2019

Honestly don't know about the past, and looks like Maven has been providing both as plugins. Maybe not that important.

I just feel that we don't have to know how to represent Maven things in Ivy API, like special care for classifier (as I said maven-artifact - previously aether can just understand the Maven coordinate format), when we just migrate to maven-artifact. Yes it would require much code change, but less knowledge to be aware about - so less code change vs less knowledge to maintain.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants