Hadoop integration #1 #803

jey · 2013-08-11T01:35:32Z

Initial patch to allow one spark binary to target multiple hadoop versions. Has a few bugs that I'll submit fixes for shortly:

sbt clean doesn't work
unncessarily recompiles some source files when $SPARK_HADOOP_VERSION is changed
doesn't work with hadoop-0.23.x

… => SparkHadoopMapReduceUtil

mateiz · 2013-08-11T01:54:59Z

core/src/main/scala/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.scala

Small code organization suggestion -- factor this into spark.Utils because you have another copy of this method in HadoopMapReduceUtil

Just gave that a shot, but turns out that spark.Utils is private to the package. Do we want to increase its visibility enough to be available from org.apache.hadoop?

jey · 2013-08-11T02:25:32Z

Another known bug: some unit tests won't pass when running against YARN

mridulm · 2013-08-11T03:35:10Z

Any reason to move yarn from inside core to outside ? I dont think it makes sense outside of core ...

CC'ing @tgravescs

jey · 2013-08-11T03:48:45Z

We are working to unify the Spark binaries so it won't have to be rebuilt for each Hadoop version. This involved moving YARN support out of core because the YARN APIs are not available when building against MapReduce v1 versions of Hadoop. The only effect that has on YARN users is that the SPARK_JAR environment variable has to be set to point to the spark-yarn assembly JAR instead of the spark-core assembly used previously.

AmplabJenkins · 2013-08-11T06:22:48Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/531/

AmplabJenkins · 2013-08-11T07:02:25Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/533/

AmplabJenkins · 2013-08-11T20:17:12Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/542/

AmplabJenkins · 2013-08-11T21:06:06Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/543/

mateiz · 2013-08-12T15:11:22Z

Hey Jey, for the default Hadoop version, I actually suggested 1.2.1, not 1.1.2.

jey · 2013-08-12T18:57:38Z

Oops. Fixed.

AmplabJenkins · 2013-08-12T19:18:26Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/554/

jey · 2013-08-12T19:38:13Z

Here's a pretty strange error from Jenkins, looks like possibily a bug in Scala's JPropertiesWrapper's toMap method? Or maybe some other thread is caling System.setProperty at around the same time?

[info] DriverSuite:
Exception in thread "main" java.util.ConcurrentModificationException
    at java.util.Hashtable$Enumerator.next(Hashtable.java:1200)
    at scala.collection.JavaConversions$JPropertiesWrapper$$anon$3.next(JavaConversions.scala:933)
    at scala.collection.JavaConversions$JPropertiesWrapper$$anon$3.next(JavaConversions.scala:930)
    at scala.collection.Iterator$class.foreach(Iterator.scala:772)
    at scala.collection.JavaConversions$JPropertiesWrapper$$anon$3.foreach(JavaConversions.scala:930)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:73)
    at scala.collection.JavaConversions$JPropertiesWrapper.foreach(JavaConversions.scala:903)
    at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:256)
    at scala.collection.JavaConversions$JPropertiesWrapper.toMap(JavaConversions.scala:903)
    at spark.SparkContext.<init>(SparkContext.scala:254)
    at spark.DriverWithoutCleanup$.main(DriverSuite.scala:53)
    at spark.DriverWithoutCleanup.main(DriverSuite.scala)
[info] - driver should exit after finishing *** FAILED ***
[info]   SparkException was thrown during property evaluation. (DriverSuite.scala:37)
[info]   Message: Process List(./run, spark.DriverWithoutCleanup, local-cluster[2,1,512]) exited with code 1
[info]   Occurred at table row 1 (zero based, not counting headings), which had values (
[info]     master = local-cluster[2,1,512]
[info]   )

Jenkins, retest this please.

jey · 2013-08-12T19:57:54Z

BTW, unit tests pass under YARN too after 1a0607a fixed a silly mistake in bin/update-classpath.sh

jey · 2013-08-12T20:10:17Z

Jenkins, retest this please.

AmplabJenkins · 2013-08-12T20:30:33Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/555/

…ckage

AmplabJenkins · 2013-08-15T20:03:26Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/598/

AmplabJenkins · 2013-08-15T20:21:44Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/599/

mridulm · 2013-08-15T20:37:08Z

Moving yarn code out of core does not look like a good decision - it does not make sense as a top level module.
It is better to simply have different profile which adds the yarn specific targets and subdirectories for building the code - the default building for non-yarn case which does not include the yarn sub-dir.

jey · 2013-08-15T20:49:33Z

It's simpler to have a different module, especially due to Maven insanity. The only way this materially affects YARN users is that they need to provide the path to the spark-yarn assembly instead of the spark-core assembly when launching a YARN job.

jey · 2013-08-15T20:50:17Z

I'm happy to consider the profile-based approach, but please list the benefits, as they are not obvious to me.

mateiz · 2013-08-15T21:49:27Z

Yeah, the core problem is that the YARN code won't work unless you are linking against a YARN-enabled Hadoop, and we don't want to be publishing separate spark-core artifacts for every Hadoop version. YARN users will just have to add spark-yarn as an additional dependency until either Hadoop improves its packaging (e.g. adds some kind of yarn-client package that still works with old Hadoop versions) or YARN is widely enough deployed that we don't want to run on non-YARN clusters (unlikely in the near future).

jey · 2013-08-15T23:54:25Z

I'm closing this PR and will submit a new one targeted against mesos/master instead of mesos/branch-0.8

mridulm · 2013-08-16T00:46:37Z

I am not commenting about including yarn artifacts within the same spark core jar - but where the code is hosted within spark source tree. Hosting it as a top level directory does not make much sense when it has no standalone value since it is very closely tied to spark core (and specializing it in a implementation dependent way).
Compare with other top level directories - bagel, ml, shark, doc, etc : it makes sense for them to exist there right now, not so for yarn directory.

Having two jars make sense - since expected functionality is different : and we can generate different artifacts when driven by different profiles. Btw, we would not need to include mesos jars while building spark yarn jar - if it is getting separated out.

mateiz · 2013-08-16T01:41:43Z

The folder structure in SBT/Maven just reflects modules, not anything about standalone value. I don't see any reason to complicate the build file to do this. We've actually fought a lot with the profiles in Maven and SBT to try todo conditional builds, and it's really painful. This stuff will still get compiled, tested, etc together.

Author: Patrick Wendell <[email protected]> Closes mesos#803 from pwendell/mapr-support and squashes the following commits: 8df60e4 [Patrick Wendell] SPARK-1862: Support for MapR in the Maven build.

jey added 13 commits August 10, 2013 18:23

move yarn to its own directory

7ad104d

remove core/src/hadoop{1,2} dirs

cfaeb81

dynamically detect hadoop version

9fcb2ac

yarn sbt

9702f90

remove hadoop-yarn's org/apache/...

9c967ad

add comment

ed689fa

yarn support

99eeaa5

rename HadoopMapRedUtil => SparkHadoopMapRedUtil, HadoopMapReduceUtil…

9c0929c

… => SparkHadoopMapReduceUtil

make SparkHadoopUtil a member of SparkEnv

6b675d9

SparkEnv isn't available this early, and not needed anyway

8afb475

re-enable YARN support

6120539

YARN ApplicationMaster shouldn't wait forever

a42a481

update YARN docs

1c1503e

mateiz reviewed Aug 11, 2013
View reviewed changes

mateiz mentioned this pull request Aug 11, 2013

Add Spark multi-user support for standalone mode #750

Closed

jey added 7 commits August 14, 2013 15:35

Fix newTaskAttemptID to work under YARN

f87f1f1

Rename HadoopWriter to SparkHadoopWriter since it's outside of our pa…

6bbf121

…ckage

Update default version of Hadoop to 1.2.1

51aa37e

Allow make-distribution.sh to specify Hadoop version used

75e8464

Fix repl/assembly when YARN enabled

c8032bf

Initial changes to make Maven build agnostic of hadoop version

92e72e5

Fix mllib tests to pass under Maven

dce74a4

jey closed this Aug 15, 2013

jey mentioned this pull request Aug 15, 2013

Hadoop agnostic builds #838

Merged

Hadoop integration #1 #803

Hadoop integration #1 #803

Uh oh!

Conversation

jey commented Aug 11, 2013

Uh oh!

mateiz Aug 11, 2013

Choose a reason for hiding this comment

Uh oh!

jey Aug 11, 2013

Choose a reason for hiding this comment

Uh oh!

jey commented Aug 11, 2013

Uh oh!

mridulm commented Aug 11, 2013

Uh oh!

jey commented Aug 11, 2013

Uh oh!

AmplabJenkins commented Aug 11, 2013

Uh oh!

AmplabJenkins commented Aug 11, 2013

Uh oh!

AmplabJenkins commented Aug 11, 2013

Uh oh!

AmplabJenkins commented Aug 11, 2013

Uh oh!

mateiz commented Aug 12, 2013

Uh oh!

jey commented Aug 12, 2013

Uh oh!

AmplabJenkins commented Aug 12, 2013

Uh oh!

jey commented Aug 12, 2013

Uh oh!

jey commented Aug 12, 2013

Uh oh!

jey commented Aug 12, 2013

Uh oh!

AmplabJenkins commented Aug 12, 2013

Uh oh!

AmplabJenkins commented Aug 15, 2013

Uh oh!

AmplabJenkins commented Aug 15, 2013

Uh oh!

mridulm commented Aug 15, 2013

Uh oh!

jey commented Aug 15, 2013

Uh oh!

jey commented Aug 15, 2013

Uh oh!

mateiz commented Aug 15, 2013

Uh oh!

jey commented Aug 15, 2013

Uh oh!

mridulm commented Aug 16, 2013

Uh oh!

mateiz commented Aug 16, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants