Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 25, 2020

What changes were proposed in this pull request?

This reverts commit SPARK-33212 (cb3fa6c) mostly with three exceptions:

  1. SparkSubmitUtils was updated recently by SPARK-33580
  2. resource-managers/yarn/pom.xml was updated recently by SPARK-33104 to add hadoop-yarn-server-resourcemanager test dependency.
  3. Adjust com.fasterxml.jackson.module:jackson-module-jaxb-annotations dependency in K8s module which is updated recently by SPARK-33471.

Why are the changes needed?

According to HADOOP-16080 since Apache Hadoop 3.1.1, hadoop-aws doesn't work with hadoop-client-api. It fails at write operation like the following.

1. Spark distribution with -Phadoop-cloud

$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = local-1606806088715).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+


scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V

2. Spark distribution without -Phadoop-cloud

$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CI.

@dongjoon-hyun
Copy link
Member Author

cc @sunchao

@SparkQA

This comment has been minimized.

@dongjoon-hyun
Copy link
Member Author

This is for testing the feasibility as one of the option.

@HyukjinKwon
Copy link
Member

I am good with reverting this first. I will take a look separately for SPARK-33104. Presumably the tests will fail with Hadoop 2.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun do you mind fixing the PR title and description to contain SPARK-33104 and 10bd42c?

@sunchao
Copy link
Member

sunchao commented Nov 26, 2020

Yes I'm fine for reverting this first while we searching for other solutions. Let's hope we can still ship this in Spark 3.1 release.

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon and @sunchao .
This is still testing to check the feasibility to revert~ This PR will wait until next Monday. :)

BTW, I'll update the PR title and description.

@dongjoon-hyun dongjoon-hyun changed the title Revert "[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile" Revert SPARK-33212 and SPARK-33104 to recover hadoop-aws for Hadoop 3.x Nov 26, 2020
@SparkQA

This comment has been minimized.

@dongjoon-hyun dongjoon-hyun changed the title Revert SPARK-33212 and SPARK-33104 to recover hadoop-aws for Hadoop 3.x Revert "[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile" Nov 30, 2020
@SparkQA

This comment has been minimized.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft December 1, 2020 07:37
…ofile"

This reverts commit cb3fa6c.

(cherry picked from commit a7dc7f92a392328bcbc95800f09d467a89d18dfe)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun changed the title Revert "[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile" [WIP][SPARK-33618][CORE] Fix hadoop-aws to work Dec 1, 2020
@dongjoon-hyun
Copy link
Member Author

Hi, All.
To investigate this more during Apache Spark 3.1 QA timeframe, I filed a new JIRA.
We have a few approaches including this and #30556 .

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-33618][CORE] Fix hadoop-aws to work [SPARK-33618][CORE] Fix hadoop-aws to work Dec 2, 2020
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review December 2, 2020 02:56
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33618][CORE] Fix hadoop-aws to work [SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work Dec 2, 2020
@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132008 has finished for PR 30508 at commit 806aa85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132012 has finished for PR 30508 at commit 8bbde84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 2, 2020

Hi, @HyukjinKwon .
Could you review this PR, please? I will reopen SPARK-33212 after merging this PR.
This will recover hadoop-aws functionality in Apache Spark 3.1.

@dongjoon-hyun
Copy link
Member Author

Also, cc @viirya , @dbtsai , @sunchao , @srowen , @AngersZhuuuu , @mridulm , @tgravescs .

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay as I compared with SPARK-33212 (cb3fa6c).

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already reviewed this actually. Was wondering which one you guys prefer. LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

hadoop-annotations/3.2.0//hadoop-annotations-3.2.0.jar
hadoop-auth/3.2.0//hadoop-auth-3.2.0.jar
hadoop-client/3.2.0//hadoop-client-3.2.0.jar
hadoop-common/3.2.0//hadoop-common-3.2.0.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pulls a ton more code into Spark now, like the whole client... hm, is this going to affect the hadoop-provided distro? it also downgrades some versions above which may be harmless. We really need this just for hadoop-aws?

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @srowen this is basically a revert. There was an issue found of shading hadoop client so it was reverted here as a safe choice. A proper fix is in progress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, nevermind. I am not following closely.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya , @HyukjinKwon and @srowen !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-33212-REVERT branch December 2, 2020 16:32
kerb-crypto/1.0.1//kerb-crypto-1.0.1.jar
kerb-identity/1.0.1//kerb-identity-1.0.1.jar
kerb-server/1.0.1//kerb-server-1.0.1.jar
kerb-simplekdc/1.0.1//kerb-simplekdc-1.0.1.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for curiosity, does spark has a chance to play the role of KDC at runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the original PR does not handle any transitive artifact exclusion at all 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants