[SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 #39124

LuciferYang · 2022-12-19T07:57:36Z

What changes were proposed in this pull request?

This pr aims to upgrade Hadoop from 3.3.4 to 3.3.5.

Why are the changes needed?

Hadoop 3.3.5 brings many bug fixes as well as CVE fixes, such as

HADOOP-18333 hadoop-client-runtime impact by CVE-2022-2047 CVE-2022-2048 due to shaded jetty
HADOOP-18468: upgrade jettison json jar due to fix CVE-2022-40149
HADOOP-18493 update jackson-databind 2.12.7.1 due to CVE fixes
HADOOP-18497 Upgrade commons-text version to fix CVE-2022-42889
HADOOP-18484 upgrade hsqldb to v2.7.1 due to CVE
HADOOP-18561 CVE-2021-37533 on commons-net is included in hadoop common and hadoop-client-runtime
HADOOP-18587 upgrade to jettison 1.5.3 to fix CVE-2022-40150

At the same time, this version brings a high performance vectored read API: HADOOP-18103, this may be used by future versions of Orc and Parquet to improve read performance.

The release notes and change log as follows:

Does this PR introduce any user-facing change?

Yes, jaxb-api-2.2.11.jar is no longer in spark-deps-hadoop-3-hive-2.3 due to HADOOP-18641

How was this patch tested?

Pass GitHub Actions

LuciferYang · 2022-12-20T03:33:38Z

Many test failed similart to the follows:

2022-12-20T03:15:37.0609530Z [info] org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite *** ABORTED *** (28 milliseconds)
2022-12-20T03:15:37.0701184Z [info]   java.lang.reflect.InvocationTargetException:
2022-12-20T03:15:37.0701846Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0702983Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0703732Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0704398Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0705400Z [info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
2022-12-20T03:15:37.0706077Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:514)
2022-12-20T03:15:37.0706751Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:374)
2022-12-20T03:15:37.0707378Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:90)
2022-12-20T03:15:37.0707917Z [info]   at scala.Option.getOrElse(Option.scala:189)
2022-12-20T03:15:37.0708804Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:90)
2022-12-20T03:15:37.0709589Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:88)
2022-12-20T03:15:37.0710320Z [info]   at org.apache.spark.sql.hive.test.TestHiveSingleton.$init$(TestHiveSingleton.scala:33)
2022-12-20T03:15:37.0711253Z [info]   at org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite.<init>(AlterTableAddColumnsSuite.scala:27)
2022-12-20T03:15:37.0712160Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0712844Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0713829Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0714480Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0714972Z [info]   at java.lang.Class.newInstance(Class.java:442)
2022-12-20T03:15:37.0715625Z [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:454)
2022-12-20T03:15:37.0716141Z [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2022-12-20T03:15:37.0716638Z [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-12-20T03:15:37.0717222Z [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-12-20T03:15:37.0718079Z [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-12-20T03:15:37.0718637Z [info]   at java.lang.Thread.run(Thread.java:750)
2022-12-20T03:15:37.0719260Z [info]   Cause: java.lang.RuntimeException: Failed to initialize default Hive configuration variables!
2022-12-20T03:15:37.0719939Z [info]   at org.apache.hadoop.hive.conf.HiveConf.getConfVarInputStream(HiveConf.java:3638)
2022-12-20T03:15:37.0720558Z [info]   at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:4057)
2022-12-20T03:15:37.0721115Z [info]   at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:4014)
2022-12-20T03:15:37.0721873Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl$.newHiveConf(HiveClientImpl.scala:1309)
2022-12-20T03:15:37.0722615Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:176)
2022-12-20T03:15:37.0723562Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:141)
2022-12-20T03:15:37.0724265Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0725154Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0815583Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0816308Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0817005Z [info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
2022-12-20T03:15:37.0817691Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:514)
2022-12-20T03:15:37.0818294Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:374)
2022-12-20T03:15:37.0818947Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:90)
2022-12-20T03:15:37.0819658Z [info]   at scala.Option.getOrElse(Option.scala:189)
2022-12-20T03:15:37.0820254Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:90)
2022-12-20T03:15:37.0820931Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:88)
2022-12-20T03:15:37.0821578Z [info]   at org.apache.spark.sql.hive.test.TestHiveSingleton.$init$(TestHiveSingleton.scala:33)
2022-12-20T03:15:37.0822321Z [info]   at org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite.<init>(AlterTableAddColumnsSuite.scala:27)
2022-12-20T03:15:37.0823043Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0823728Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0824474Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0825300Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0825805Z [info]   at java.lang.Class.newInstance(Class.java:442)
2022-12-20T03:15:37.0826341Z [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:454)
2022-12-20T03:15:37.0826959Z [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2022-12-20T03:15:37.0827461Z [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-12-20T03:15:37.0832346Z [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-12-20T03:15:37.0838605Z [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-12-20T03:15:37.0844439Z [info]   at java.lang.Thread.run(Thread.java:750)
2022-12-20T03:15:37.0851150Z [info]   Cause: java.lang.IllegalArgumentException: Not supported: http://javax.xml.XMLConstants/property/accessExternalDTD
2022-12-20T03:15:37.0857679Z [info]   at org.apache.xalan.processor.TransformerFactoryImpl.setAttribute(TransformerFactoryImpl.java:571)
2022-12-20T03:15:37.0863755Z [info]   at org.apache.hadoop.util.XMLUtils.newSecureTransformerFactory(XMLUtils.java:141)
2022-12-20T03:15:37.0869737Z [info]   at org.apache.hadoop.conf.Configuration.writeXml(Configuration.java:3584)
2022-12-20T03:15:37.0875703Z [info]   at org.apache.hadoop.conf.Configuration.writeXml(Configuration.java:3550)
2022-12-20T03:15:37.0881683Z [info]   at org.apache.hadoop.conf.Configuration.writeXml(Configuration.java:3546)
2022-12-20T03:15:37.0887575Z [info]   at org.apache.hadoop.hive.conf.HiveConf.getConfVarInputStream(HiveConf.java:3634)
2022-12-20T03:15:37.0893660Z [info]   at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:4057)
2022-12-20T03:15:37.0898428Z [info]   at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:4014)
2022-12-20T03:15:37.0904308Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl$.newHiveConf(HiveClientImpl.scala:1309)
2022-12-20T03:15:37.0910423Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:176)
2022-12-20T03:15:37.0916293Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:141)
2022-12-20T03:15:37.0921497Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0927701Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0932171Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0938174Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0943319Z [info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
2022-12-20T03:15:37.0992641Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:514)
2022-12-20T03:15:37.1065786Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:374)
2022-12-20T03:15:37.1066478Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:90)
2022-12-20T03:15:37.1067041Z [info]   at scala.Option.getOrElse(Option.scala:189)
2022-12-20T03:15:37.1067646Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:90)
2022-12-20T03:15:37.1068489Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:88)
2022-12-20T03:15:37.1069148Z [info]   at org.apache.spark.sql.hive.test.TestHiveSingleton.$init$(TestHiveSingleton.scala:33)
2022-12-20T03:15:37.1069906Z [info]   at org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite.<init>(AlterTableAddColumnsSuite.scala:27)
2022-12-20T03:15:37.1070634Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.1071314Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.1072059Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.1072709Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.1073209Z [info]   at java.lang.Class.newInstance(Class.java:442)
2022-12-20T03:15:37.1073822Z [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:454)
2022-12-20T03:15:37.1074354Z [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2022-12-20T03:15:37.1074847Z [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-12-20T03:15:37.1075432Z [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-12-20T03:15:37.1076054Z [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-12-20T03:15:37.1076558Z [info]   at java.lang.Thread.run(Thread.java:750)

The test failed due to the hive conf failed to initialize after upgrading hadoop 3.3.5, it seems that Spark need to wait for hive to support hadoop 3.3.5 first?

cc @sunchao @dongjoon-hyun FYI

LuciferYang · 2022-12-20T03:48:42Z

also cc @wangyum

sunchao · 2022-12-20T18:56:49Z

cc @steveloughran any idea on what could caused the above error?

LuciferYang · 2022-12-21T08:05:51Z

Maybe due to https://github.com/apache/hadoop/pull/4940/files? Some xml parsers features are disabled, possibly to fix CVE-2022-34169?

LuciferYang · 2022-12-21T08:37:14Z

hmm..., maybe there is some conflict. The attribute ACCESS_EXTERNAL_DTD is not recognized by TransformerFactory

https://github.com/apache/hadoop/blob/5f08e51b72330b2dd2405896b39179a64a3a7efe/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/XMLUtils.java#L141

LuciferYang · 2022-12-21T09:40:51Z

@steveloughran Do you know the correct class type that XMLUtils.newSecureTransformerFactory should return? I want to try to configure javax.xml.transform.TransformerFactory.

LuciferYang · 2022-12-21T09:45:42Z

efec8ce merge with master, then org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite local test pass, let us retry with GA

steveloughran · 2022-12-21T11:24:51Z

this should have been fixed by "HADOOP-18575. Make XML transformer factory more lenient (#5224)." which is in the "real" rc0 I'm going to put up...that little one we did last week was really an attempt at debugging the process of getting a release built where the x86 code is done on an EC2 VM, arm64 on my laptop, making sure only the x86 artifacts are the ones we publish as staging, rename/resign the arm stuff etc (that bit still needs automation in https://github.com/steveloughran/validate-hadoop-client-artifacts ...)

doing the rc0 release process today

thank you for doing this branch! I'd verified the compile was good, but hadn't run the tests.

steveloughran · 2022-12-21T11:29:45Z

oh, and the change isn't related to that xalan cve -more that we wanted to put all xml parser/xsl transformer creation into one place and lock them down so as to avoid any risk of some instances being created without secure settings (HADOOP-18469
Add XMLUtils methods to centralise code that creates secure XML parsers)

ironically, sonatype security scans are already warning on hadoop versions without the change...if we hadn't done the lockdown it wouldn't be complaining. Makes you want to not bother, doesn't it?

LuciferYang · 2022-12-21T11:38:32Z

Re-trigger GA found that the dependencies of hadoop 3.3.5 could not be downloaded. Let's wait until downloading is available again to re-analyze the test failed.

steveloughran · 2022-12-22T13:59:18Z

the real rc0 is up. announcement below. I suspect it will be the transitive jar updates and other lockdown options which create issues...we had to downgrade jackson for tez in HADOOP-18332, then there's jetty. left that alone.

i'd like to see if i can get apache/hadoop#4996 ready for an rc1 so we can cut protobuf 2.5 (which was removed, then reinstated as a dependency). once cut only those apps which need it can add it themselves.

arm binaries too. I'm also wondering if we should do a lean build without the fat shaded aws sdk. we need that so for classpath reasons, it's just so huge as it contains everything, even though nobody is trying to control aws satellite groundstations from big data apps. (analysis, yes. but control. nope, yet it's in com.amazonaws.services.groundstation hence the eternal bloat). cut that jar and the distro is half the size

From: Steve Loughran
Date: Wed, 21 Dec 2022 at 19:28
Subject: [VOTE] Release Apache Hadoop 3.3.5

Mukund and I have put together a release candidate (RC0) for Hadoop 3.3.5.

Given the time of year it's a bit unrealistic to run a 5 day vote and expect people to be able to test it thoroughly enough to make this the one we can ship.

What we would like is for anyone who can to verify the tarballs, and test the binaries, especially anyone who can try the arm64 binaries. We've got the building of those done and now the build file will incorporate them into the release -but neither of us have actually tested it yet. Maybe I should try it on my pi400 over xmas.

The maven artifacts are up on the apache staging repo -they are the ones from x86 build. Building and testing downstream apps will be incredibly helpful.

The RC is available at:
https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.5-RC0/

The git tag is release-3.3.5-RC0, commit 3262495904d

The maven artifacts are staged at
https://repository.apache.org/content/repositories/orgapachehadoop-1365/

You can find my public key at:
https://dist.apache.org/repos/dist/release/hadoop/common/KEYS

Change log
https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.5-RC0/CHANGELOG.md

Release notes
https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.5-RC0/RELEASENOTES.md

This is off branch-3.3 and is the first big release since 3.3.2.

Key changes include

Big update of dependencies to try and keep those reports of
transitive CVEs under control -both genuine and false positive.
HDFS RBF enhancements
Critical fix to ABFS input stream prefetching for correct reading.
Vectored IO API for all FSDataInputStream implementations, with
high-performance versions for file:// and s3a:// filesystems.
file:// through java native io
s3a:// parallel GET requests.
This release includes Arm64 binaries. Please can anyone with
compatible systems validate these.

Please try the release and vote on it, even though i don't know what is a good timeline here...i'm actually going on holiday in early jan. Mukund is around and so can drive the process while I'm offline.

Assuming we do have another iteration, the RC1 will not be before mid jan for that reason

Steve (and mukund)

LuciferYang · 2022-12-23T02:12:11Z

@sunchao @steveloughran @dongjoon-hyun Now all GA Task have passed, except Spark on Kubernetes Integration test, but I think it can also pass when it can be downloaded hadoop 3.3.5 from https://repo1.maven.org/maven2/

LuciferYang · 2022-12-23T07:18:13Z

I will keep this pr open to test the next rc or release in time

steveloughran · 2022-12-23T14:41:43Z

so the k8s integration test doesn't pick up any -Psnapshots-and-staging profile?

LuciferYang · 2022-12-26T02:41:29Z

so the k8s integration test doesn't pick up any -Psnapshots-and-staging profile?

yes

steveloughran · 2023-01-31T10:06:07Z

trying to come out with a new RC; few remaining blockers (hdfs IPC regression, some yarn thing and javadocs not getting in to site)

dongjoon-hyun · 2023-01-31T17:50:25Z

Thank you so much, @steveloughran .

LuciferYang · 2023-02-01T12:08:21Z

Thanks @steveloughran

steveloughran · 2023-02-08T17:30:40Z

bTW, I've been testing #39185 on 3.3.5, switching to the new manifest committer added for abfs/gcs commit performance; works well. That change doesn't depend on this PR, it just chooses the new committer if found on the classpath

LuciferYang · 2023-03-13T13:34:29Z

As before, there are no more failed cases

steveloughran · 2023-03-15T20:44:22Z

got a new RC up to play with...hopefully RC3 will ship. main changes are fixes to some HDFS cases which can trigger NPEs

dongjoon-hyun

Hi, @LuciferYang . Could you update this PR to the official release? :)

dongjoon-hyun · 2023-03-23T17:27:02Z

dev/deps/spark-deps-hadoop-3-hive-2.3

 commons-pool/1.5.4//commons-pool-1.5.4.jar
 commons-text/1.10.0//commons-text-1.10.0.jar
 compress-lzf/1.1.2//compress-lzf-1.1.2.jar
+cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar


BTW, is this introduced back?

yes, but its the version with the updated suffix list. apache/hadoop#4444

It's a little magical, apache/hadoop#4444 is a version upgrade. I think it should be easier to understand when similar

cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar -> cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar

changes occur, but when using Hadoop 3.3.4, this dependency does not appear in spark-deps-hadoop-3-hive-2.3

Is this caused by #39124 (comment)?

it was removed in Hadoop 3.3.4 (via https://issues.apache.org/jira/browse/HADOOP-18307) but added back in Hadoop 3.3.5

dongjoon-hyun · 2023-03-23T17:27:15Z

dev/deps/spark-deps-hadoop-3-hive-2.3

+hadoop-client-api/3.3.5//hadoop-client-api-3.3.5.jar
+hadoop-client-runtime/3.3.5//hadoop-client-runtime-3.3.5.jar
+hadoop-cloud-storage/3.3.5//hadoop-cloud-storage-3.3.5.jar
+hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar


Ditto. Is this added back?

mmm. provided it doesn't interfere with everyone else, then getting it means spark will work out the box with that storage.

I'm just worry about HADOOP-18307 Remove hadoop-cos as a dependency of hadoop-cloud-storage situation.

@dongjoon-hyun Do we need to more additional check for this dependency?

I'm just curious whether similar issue as described in https://issues.apache.org/jira/browse/HADOOP-18159 could happen again if we include hadoop-cos and cos_api-bundle in Spark's class path. We actually just ran into this exact issue recently :)

It'd be nice if there is an easy way to make this optional.

steveloughran · 2023-03-23T20:28:45Z

the hadoop 3.3.5 release is now officially out.

dongjoon-hyun · 2023-03-23T20:43:13Z

Ya, I saw the official Hadoop release and want to resume this, @steveloughran and @LuciferYang . :)

LuciferYang · 2023-03-24T00:41:17Z

Remove asf staging repository and re test

LuciferYang · 2023-03-24T04:29:53Z

All GA task passed with official release

dongjoon-hyun

Shall we exclude the following dependencies from our side and let the user add them if they need?

cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar

LuciferYang · 2023-03-25T00:42:49Z

Shall we exclude the following dependencies from our side and let the user add them if they need?
cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar

exclude them from hadoop-cloud module

LuciferYang · 2023-03-25T01:15:12Z

dev/deps/spark-deps-hadoop-3-hive-2.3

 javassist/3.25.0-GA//javassist-3.25.0-GA.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
 javolution/5.5.1//javolution-5.5.1.jar
-jaxb-api/2.2.11//jaxb-api-2.2.11.jar


Do we need to manually add back this dependency? It disappeared from hadoop-aliyun's dependency chain:

3.3.4

[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.4:compile [INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.3.4:compile [INFO] | | \- com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile [INFO] | | +- org.jdom:jdom2:jar:2.0.6:compile [INFO] | | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | | | \- stax:stax-api:jar:1.0.1:compile [INFO] | | +- com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile [INFO] | | | +- javax.xml.bind:jaxb-api:jar:2.2.11:compile [INFO] | | | +- org.ini4j:ini4j:jar:0.5.4:compile [INFO] | | | +- io.opentracing:opentracing-api:jar:0.33.0:compile [INFO] | | | \- io.opentracing:opentracing-util:jar:0.33.0:compile [INFO] | | | \- io.opentracing:opentracing-noop:jar:0.33.0:compile [INFO] | | +- com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile [INFO] | | \- com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile [INFO] | \- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.4:compile [INFO] | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile

3.3.5

INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.5:compile [INFO] | +- org.apache.hadoop:hadoop-annotations:jar:3.3.5:compile [INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.3.5:compile [INFO] | | +- com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile [INFO] | | | +- org.jdom:jdom2:jar:2.0.6:compile [INFO] | | | +- com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile [INFO] | | | | +- org.ini4j:ini4j:jar:0.5.4:compile [INFO] | | | | +- io.opentracing:opentracing-api:jar:0.33.0:compile [INFO] | | | | \- io.opentracing:opentracing-util:jar:0.33.0:compile [INFO] | | | | \- io.opentracing:opentracing-noop:jar:0.33.0:compile [INFO] | | | +- com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile [INFO] | | | \- com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile [INFO] | | \- org.codehaus.jettison:jettison:jar:1.5.3:compile [INFO] | \- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.5:compile [INFO] | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile

apache/hadoop@72f8c2a

exclude jaxb-api from aliyun-sdk-oss

steveloughran · 2023-03-27T11:13:39Z

what version of jettison has come in from hadoop-common?

HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports.

LuciferYang · 2023-03-27T12:06:18Z

what version of jettison has come in from hadoop-common?

HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports.

1.5.3

dongjoon-hyun

+1, LGTM for Apache Spark 3.5.0 from my side. Thank you, @LuciferYang .

LuciferYang · 2023-03-27T15:49:47Z

Thanks @dongjoon-hyun @sunchao @steveloughran

dongjoon-hyun · 2023-03-27T15:53:11Z

Ya, right. I forgot to say that. Thank you so much, @steveloughran and @sunchao too. 😄

This pr aims to upgrade Hadoop from 3.3.4 to 3.3.5. Hadoop 3.3.5 brings many bug fixes as well as CVE fixes, such as - HADOOP-18333 hadoop-client-runtime impact by CVE-2022-2047 CVE-2022-2048 due to shaded jetty - HADOOP-18468: upgrade jettison json jar due to fix CVE-2022-40149 - HADOOP-18493 update jackson-databind 2.12.7.1 due to CVE fixes - HADOOP-18497 Upgrade commons-text version to fix CVE-2022-42889 - HADOOP-18484 upgrade hsqldb to v2.7.1 due to CVE - HADOOP-18561 CVE-2021-37533 on commons-net is included in hadoop common and hadoop-client-runtime - HADOOP-18587 upgrade to jettison 1.5.3 to fix CVE-2022-40150 At the same time, this version brings a high performance vectored read API: HADOOP-18103, this may be used by future versions of `Orc` and `Parquet` to improve read performance. The release notes and change log as follows: - https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/RELEASENOTES.3.3.5.html - https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/CHANGELOG.3.3.5.html Yes, `jaxb-api-2.2.11.jar` is no longer in `spark-deps-hadoop-3-hive-2.3` due to HADOOP-18641 Pass GitHub Actions Closes apache#39124 from LuciferYang/test-hadoop-335. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to downgrade the Apache Hadoop dependency to 3.3.4 in `Apache Spark 3.5` in order to prevent any regression from `Apache Spark 3.4.x`. In other words, although `Apache Spark 3.5.x` will lose many bug fixes of Apache Hadoop 3.3.5 and 3.3.6, it will be in the same situation with `Apache Spark 3.4.x`. - SPARK-44197 Upgrade Hadoop to 3.3.6 (#41744) - SPARK-42913 Upgrade Hadoop to 3.3.5 (#39124) - SPARK-43448 Remove dummy dependency `hadoop-openstack` (#41133) On top of reverting SPARK-44197 and SPARK-42913, this PR has additional dependency exclusion change due to the following. - SPARK-43880 Organize `hadoop-cloud` in standard maven project structure (#41380) ### Why are the changes needed? There is a community report on S3A committer performance regression. Although it's one liner fix, there is no available Hadoop release with that fix at this time. - HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer (apache/hadoop#5706) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #42345 from dongjoon-hyun/SPARK-44678. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

LuciferYang added 2 commits December 19, 2022 15:23

upgrade

128a1a1

update deps file

20738ab

github-actions bot added BUILD SQL labels Dec 19, 2022

Merge branch 'upmaster' into test-hadoop-335

efec8ce

LuciferYang added 3 commits December 22, 2022 13:33

tmp change ADDITIONAL_REMOTE_REPOSITORIES

7f2e23b

Merge branch 'upmaster' into test-hadoop-335

1392558

tmp change k8s it

2f53c26

github-actions bot added the KUBERNETES label Dec 22, 2022

Merge branch 'apache:master' into test-hadoop-335

35320cb

Merge branch 'upmaster' into test-hadoop-335

9872cf2

LuciferYang added 2 commits March 20, 2023 12:58

Merge branch 'apache:master' into test-hadoop-335

16a70d1

Merge branch 'apache:master' into test-hadoop-335

c3ecb5b

dongjoon-hyun reviewed Mar 23, 2023

View reviewed changes

remove asf staging

38169b9

github-actions bot removed the KUBERNETES label Mar 24, 2023

LuciferYang changed the title ~~[DON'T MERGE] Test build and test with hadoop 3.3.5-RC2~~ [DON'T MERGE] Test build and test with hadoop 3.3.5 Mar 24, 2023

LuciferYang changed the title ~~[DON'T MERGE] Test build and test with hadoop 3.3.5~~ [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 Mar 24, 2023

dongjoon-hyun reviewed Mar 24, 2023

View reviewed changes

exclucde hadoop-cos

f826237

Merge branch 'apache:master' into test-hadoop-335

6169895

LuciferYang commented Mar 25, 2023

View reviewed changes

Merge branch 'apache:master' into test-hadoop-335

ef8f3aa

dongjoon-hyun approved these changes Mar 27, 2023

View reviewed changes

dongjoon-hyun closed this in de4bc29 Mar 27, 2023

dongjoon-hyun mentioned this pull request Aug 4, 2023

[SPARK-44678][BUILD][3.5] Downgrade Hadoop to 3.3.4 #42345

Closed

[SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 #39124

[SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 #39124

Uh oh!

Conversation

LuciferYang commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Dec 20, 2022

Uh oh!

sunchao commented Dec 20, 2022

Uh oh!

LuciferYang commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Dec 21, 2022

Uh oh!

LuciferYang commented Dec 21, 2022

Uh oh!

LuciferYang commented Dec 21, 2022

Uh oh!

steveloughran commented Dec 21, 2022

Uh oh!

steveloughran commented Dec 21, 2022

Uh oh!

LuciferYang commented Dec 21, 2022

Uh oh!

steveloughran commented Dec 22, 2022

Uh oh!

LuciferYang commented Dec 23, 2022

Uh oh!

LuciferYang commented Dec 23, 2022

Uh oh!

steveloughran commented Dec 23, 2022

Uh oh!

LuciferYang commented Dec 26, 2022

Uh oh!

steveloughran commented Jan 31, 2023

Uh oh!

dongjoon-hyun commented Jan 31, 2023

Uh oh!

LuciferYang commented Feb 1, 2023

Uh oh!

steveloughran commented Feb 8, 2023

Uh oh!

LuciferYang commented Mar 13, 2023

Uh oh!

steveloughran commented Mar 15, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Mar 23, 2023

Uh oh!

LuciferYang commented Dec 19, 2022 •

edited

Loading

LuciferYang commented Dec 20, 2022 •

edited

Loading

LuciferYang commented Dec 21, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading