Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Jan 15, 2021

What changes were proposed in this pull request?

  1. Add back Maven enforcer for duplicate dependencies check
  2. More strict check on Hadoop versions which support shaded client in IsolatedClientLoader. To do proper version check, this adds a util function majorMinorPatchVersion to extract major/minor/patch version from a string.
  3. Cleanup unnecessary code

Why are the changes needed?

The Maven enforcer was removed as part of #30556. This proposes to add it back.

Also, Hadoop shaded client doesn't work in certain cases (see these comments for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fast follow-on Chao!

@SparkQA
Copy link

SparkQA commented Jan 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38713/

@SparkQA
Copy link

SparkQA commented Jan 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38713/

@SparkQA
Copy link

SparkQA commented Jan 16, 2021

Test build #134130 has finished for PR 31203 at commit 062b7bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [SPARK-33212][FOLLOW-UP][BUILD] Bring back duplicate dependency check and add more strict Hadoop version check [SPARK-33212][FOLLOW-UP][test-maven][test-hadoop2.7] Bring back duplicate dependency check and add more strict Hadoop version check Jan 16, 2021
@SparkQA
Copy link

SparkQA commented Jan 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38729/

@SparkQA
Copy link

SparkQA commented Jan 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38729/

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Test build #134146 has finished for PR 31203 at commit b5d82b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [SPARK-33212][FOLLOW-UP][test-maven][test-hadoop2.7] Bring back duplicate dependency check and add more strict Hadoop version check [SPARK-33212][FOLLOW-UP][BUILD] Bring back duplicate dependency check and add more strict Hadoop version check Jan 22, 2021
@sunchao sunchao force-pushed the SPARK-33212-followup branch from b5d82b7 to 8650a73 Compare January 22, 2021 20:15
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably out of the scope of this PR. I'll open a new one if we agree this is the right thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a good improvement to me, that table is pretty unsightly as-is.

@sunchao sunchao force-pushed the SPARK-33212-followup branch from 8650a73 to 6657c87 Compare January 22, 2021 20:18
Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should go into VersionUtils? That does claim to be specifically for working with Spark version strings, but it seems relevant...

def supportHadoopShadedClient(hadoopVersion: String): Boolean = {
getVersionParts(hadoopVersion).exists {
case (3, 2, v) if v >= 2 => true
case _ => false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

case (maj, _, _) if maj > 3 => true
case (3, min, _) if min > 2 => true
case (3, 2, patch) if patch >=2 => true

Seems like we can reasonably assume that future versions of Hadoop will support the shaded client?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better to wait until the future versions come out before changing this (so that we can verify firs). For instance, Hadoop 3.3.0 currently doesn't support shaded client (due to the hadoop-aws issue). But yeah the Hadoop 3.2.2+ should support the shaded client assuming there's no regression.

Copy link
Contributor

@xkrogen xkrogen Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... I just worry about forgetting to update this if/when we bump the Hadoop version in the future and causing a regression. Has the hadoop-aws fix made it to be targeted for Hadoop 3.3.1? If so, can we reasonably assume that 3.2.2+, 3.3.1+, and 3.4.0+ will have it?

It seems you're more tied into what's happening in the Hadoop world than I am these days so I'll take your word in either direction. If we decide not to future-proof it, can we create a follow-up JIRA to revisit it once some future release is out at which time we would be confident in putting a wildcard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your concern is valid. One thing we can do is perhaps adding a test to make sure that the built-in Hadoop version is always compatible with the shaded client. So that in future if we upgrade Hadoop version & forget to do this, the test will break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes we can assume that 3.2.2+, 3.3.1+ and 3.4.0+ will all have the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea on adding a compatibility test for the built-in Hadoop version!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a good improvement to me, that table is pretty unsightly as-is.

@sunchao
Copy link
Member Author

sunchao commented Jan 22, 2021

I wonder if this should go into VersionUtils? That does claim to be specifically for working with Spark version strings, but it seems relevant...

Oops I wasn't even aware of VersionUtils. Yes I agree it seems a better place to put this code. Let me update the PR later.

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Test build #134382 has finished for PR 31203 at commit 6657c87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- move version parsing to `VersionUtils`
- add compatibility test
- use `Option` for null
@github-actions github-actions bot added the CORE label Jan 23, 2021
@SparkQA
Copy link

SparkQA commented Jan 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38991/

@SparkQA
Copy link

SparkQA commented Jan 24, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38991/

@SparkQA
Copy link

SparkQA commented Jan 24, 2021

Test build #134405 has finished for PR 31203 at commit 9b8a911.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Retrieves the major, minor and patch parts from the input `version`. Returns `None` if the
* input is not of a valid format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention that minor/patch versions are filled in as 0 if they're not found. This is different from the behavior of other methods in this class (e.g. majorMinor will give an error if minor is not present)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point.

@sunchao sunchao marked this pull request as ready for review January 26, 2021 10:43
@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39085/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39085/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134499 has finished for PR 31203 at commit 396130b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xkrogen
Copy link
Contributor

xkrogen commented Jan 26, 2021

Changes LGTM thanks @sunchao !

@sunchao
Copy link
Member Author

sunchao commented Jan 26, 2021

Thanks @xkrogen for the thorough review! @dongjoon-hyun @viirya could you take a look?

def supportsHadoopShadedClient(hadoopVersion: String): Boolean = {
VersionUtils.majorMinorPatchVersion(hadoopVersion).exists {
case (3, 2, v) if v >= 2 => true
case _ => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I agree that we need to change this again.

And yes we can assume that 3.2.2+, 3.3.1+ and 3.4.0+ will all have the fix.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (with only one minor comment about comment typo)

Comment on lines -126 to -130
val extraExclusions = if (hadoopVersion.startsWith("3")) {
// this introduced from lower version of Hive could conflict with jars in Hadoop 3.2+, so
// exclude here in favor of the ones in Hadoop 3.2+
Seq("org.apache.hadoop:hadoop-auth")
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to exclude hadoop-auth anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's no longer needed. Please see here and here for the context.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with one question.

@dongjoon-hyun
Copy link
Member

Since the last question is addressed, I'll merge this. Thanks!

HyukjinKwon pushed a commit that referenced this pull request Jan 29, 2021
…uld support shaded client" for hadoop-2.7

### What changes were proposed in this pull request?
We added test "built-in Hadoop version should support shaded client" in #31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2.

### Why are the changes needed?
The test fails in master branch when profile hadoop-2.7 is activated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Ran the test with hadoop-2.7 profile.

Closes #31391 from bozhang2820/fix-hadoop-2-version-test.

Authored-by: Bo Zhang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
… and add more strict Hadoop version check

### What changes were proposed in this pull request?

1. Add back Maven enforcer for duplicate dependencies check
2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string.
3. Cleanup unnecessary code

### Why are the changes needed?

The Maven enforcer was removed as part of apache#30556. This proposes to add it back.

Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#31203 from sunchao/SPARK-33212-followup.

Lead-authored-by: Chao Sun <[email protected]>
Co-authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…uld support shaded client" for hadoop-2.7

### What changes were proposed in this pull request?
We added test "built-in Hadoop version should support shaded client" in apache#31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2.

### Why are the changes needed?
The test fails in master branch when profile hadoop-2.7 is activated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Ran the test with hadoop-2.7 profile.

Closes apache#31391 from bozhang2820/fix-hadoop-2-version-test.

Authored-by: Bo Zhang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
pan3793 pushed a commit to pan3793/spark that referenced this pull request Aug 30, 2021
… and add more strict Hadoop version check

1. Add back Maven enforcer for duplicate dependencies check
2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string.
3. Cleanup unnecessary code

The Maven enforcer was removed as part of apache#30556. This proposes to add it back.

Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.

No.

Existing tests.

Closes apache#31203 from sunchao/SPARK-33212-followup.

Lead-authored-by: Chao Sun <[email protected]>
Co-authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
pan3793 pushed a commit to pan3793/spark that referenced this pull request Aug 30, 2021
…uld support shaded client" for hadoop-2.7

### What changes were proposed in this pull request?
We added test "built-in Hadoop version should support shaded client" in apache#31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2.

### Why are the changes needed?
The test fails in master branch when profile hadoop-2.7 is activated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Ran the test with hadoop-2.7 profile.

Closes apache#31391 from bozhang2820/fix-hadoop-2-version-test.

Authored-by: Bo Zhang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants