Skip to content

Conversation

@steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Aug 26, 2016

What changes were proposed in this pull request?

increment the hadoop.version value in the hadoop-2.7 profile from 2.7.2 to 2.7.3

This switches to the latest release in the 2.7.x line. Bug fixes, continued compatibility with Java 7.

How was this patch tested?

spark unit tests system tests performed on the version of spark created with this profile enabled

…is is to see if the jenkins builds pick this up; I'm not proposing it as part of the final patch
@steveloughran
Copy link
Contributor Author

This patch tries to set the default version to 2.7; I'll see if SBT picks it up.

This is not something I'm proposing for the final merge; there I expect people to still go -Phadoop-2.7. What I'm trying to do is to get SBT to run all its tests against Hadoop 2.7.3, so that jenkins can assess the validity of the proposal.

@srowen
Copy link
Member

srowen commented Aug 26, 2016

I'd rather jump straight to the question: is there much value in separately supporting Hadoop 2.2 -> 2.5? And then if we're on 2.6+, is there even any difference with 2.7 that requires a separate profile? these Hadoop profiles are an annoyance, but needed when Hadoop 1.x was in the picture. They're barely needed now.

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64464 has finished for PR 14827 at commit 515b9ce.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


<profile>
<id>hadoop-2.7</id>
<activation>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC as soon as you set any profile at all, the active by default ones are disabled. This could be problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I set this profile for an SBT build/test run? I just want to have SBT doing it as otherwise the patch is unverified by the machinery. And nobody wants that —do they?

@steveloughran
Copy link
Contributor Author

Sean, the reason for a 2.7 profile is more significant with SPARK-7481 and cloud support, as it can explicitly pull in hadoop-azure (2.7+ only) and hadoop-aws (2.6+ only).

@srowen
Copy link
Member

srowen commented Sep 2, 2016

OK that makes sense as a reason to have 2.7 vs <2.6. We already have a profile for 2.7 anyway. I don't know if it will help to make it active by default here given how profile activation works.

Bumping to 2.7.3 is fine. But do you perhaps mean to suggest we bump the default Hadoop profile up? to 2.6, 2.7 even? that would really be a change to how the release is built, and how the PR builders run.

If we're doing that, it's worth asking: what's the cost/benefit of supporting <2.6 anyway? I think all the major distros are on at least 2.6 for like 2 years. EMR is on 2.7. CDH is the laggard if anything being on 2.6 + a large number of patches towards 2.7.

It would let us undo a mild bit of reflection hackery in the code and more freely use Hadoop APIs. We'd get rid of loads of build profiles too, hey.

I would not mind hijacking your issue and turning it into this question instead.

Whether it makes sense to start pulling in hadoop-aws etc is a different question.

@steveloughran
Copy link
Contributor Author

steveloughran commented Sep 15, 2016

I don't know what the default Hadoop version should be, that's the kind of thing to discuss on mailing lists

personally, I'd rush to make 2.6 the bare minimum version; nobody should be using anything below, especially given JVM requirements mean that you can't easily go below that. (Twitter are still using 2.6 and leading the 2.6.5 release BTW; they are the main 2.6 user that I know of).

One thing that would be good would be for jenkins to test on a later profile alongside the bare minimum version considered supportable. Testing with old version: ensures that you don' t accidentally code for later APIs. Testing with newer version: ensures that any module built for the later versions only work, and catch regressions in Hadoop itself.

  1. I don't know what Hadopo APIs MapR codes against.
  2. Yeah, reflection is bad. Makes it hard to identify when methods are being used, when things change

Regarding pulling in aws &c, the WiP patch pulls things in automatically on 2.7 profile. I could add a cloud option which would only build the module if set; and only include the JARs in the spark-assembly. I had had the module pull in the hadoop cloud JARs but not any of the dependent JARs; this would keep the spark-assembly JAR, but on Hadoop < 2.7.3 cause problems on service load.

Anyway, how about you start the discussion on Hadoop versions, this profile goes in *and I make the spark cloud a specific profile which only compiles/runs if hadoop version > 2.7. (you'd need to set both; already you need -Phive for the dataframe-on-cloud tests anyway

@steveloughran steveloughran changed the title [SPARK-17259] [build] [WiP] Hadoop 2.7 profile to depend on Hadoop 2.7.3 [SPARK-17259] [build] Hadoop 2.7 profile to depend on Hadoop 2.7.3 Sep 16, 2016
@srowen
Copy link
Member

srowen commented Sep 19, 2016

Go ahead and close this one but I think you deserve 'credit' for the JIRA change, if that makes any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants