Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Dec 19, 2022

What changes were proposed in this pull request?

This pr aims to upgrade Hadoop from 3.3.4 to 3.3.5.

Why are the changes needed?

Hadoop 3.3.5 brings many bug fixes as well as CVE fixes, such as

  • HADOOP-18333 hadoop-client-runtime impact by CVE-2022-2047 CVE-2022-2048 due to shaded jetty
  • HADOOP-18468: upgrade jettison json jar due to fix CVE-2022-40149
  • HADOOP-18493 update jackson-databind 2.12.7.1 due to CVE fixes
  • HADOOP-18497 Upgrade commons-text version to fix CVE-2022-42889
  • HADOOP-18484 upgrade hsqldb to v2.7.1 due to CVE
  • HADOOP-18561 CVE-2021-37533 on commons-net is included in hadoop common and hadoop-client-runtime
  • HADOOP-18587 upgrade to jettison 1.5.3 to fix CVE-2022-40150

At the same time, this version brings a high performance vectored read API: HADOOP-18103, this may be used by future versions of Orc and Parquet to improve read performance.

The release notes and change log as follows:

Does this PR introduce any user-facing change?

Yes, jaxb-api-2.2.11.jar is no longer in spark-deps-hadoop-3-hive-2.3 due to HADOOP-18641

How was this patch tested?

Pass GitHub Actions

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Dec 20, 2022

Many test failed similart to the follows:

2022-12-20T03:15:37.0609530Z [info] org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite *** ABORTED *** (28 milliseconds)
2022-12-20T03:15:37.0701184Z [info]   java.lang.reflect.InvocationTargetException:
2022-12-20T03:15:37.0701846Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0702983Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0703732Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0704398Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0705400Z [info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
2022-12-20T03:15:37.0706077Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:514)
2022-12-20T03:15:37.0706751Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:374)
2022-12-20T03:15:37.0707378Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:90)
2022-12-20T03:15:37.0707917Z [info]   at scala.Option.getOrElse(Option.scala:189)
2022-12-20T03:15:37.0708804Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:90)
2022-12-20T03:15:37.0709589Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:88)
2022-12-20T03:15:37.0710320Z [info]   at org.apache.spark.sql.hive.test.TestHiveSingleton.$init$(TestHiveSingleton.scala:33)
2022-12-20T03:15:37.0711253Z [info]   at org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite.<init>(AlterTableAddColumnsSuite.scala:27)
2022-12-20T03:15:37.0712160Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0712844Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0713829Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0714480Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0714972Z [info]   at java.lang.Class.newInstance(Class.java:442)
2022-12-20T03:15:37.0715625Z [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:454)
2022-12-20T03:15:37.0716141Z [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2022-12-20T03:15:37.0716638Z [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-12-20T03:15:37.0717222Z [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-12-20T03:15:37.0718079Z [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-12-20T03:15:37.0718637Z [info]   at java.lang.Thread.run(Thread.java:750)
2022-12-20T03:15:37.0719260Z [info]   Cause: java.lang.RuntimeException: Failed to initialize default Hive configuration variables!
2022-12-20T03:15:37.0719939Z [info]   at org.apache.hadoop.hive.conf.HiveConf.getConfVarInputStream(HiveConf.java:3638)
2022-12-20T03:15:37.0720558Z [info]   at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:4057)
2022-12-20T03:15:37.0721115Z [info]   at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:4014)
2022-12-20T03:15:37.0721873Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl$.newHiveConf(HiveClientImpl.scala:1309)
2022-12-20T03:15:37.0722615Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:176)
2022-12-20T03:15:37.0723562Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:141)
2022-12-20T03:15:37.0724265Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0725154Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0815583Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0816308Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0817005Z [info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
2022-12-20T03:15:37.0817691Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:514)
2022-12-20T03:15:37.0818294Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:374)
2022-12-20T03:15:37.0818947Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:90)
2022-12-20T03:15:37.0819658Z [info]   at scala.Option.getOrElse(Option.scala:189)
2022-12-20T03:15:37.0820254Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:90)
2022-12-20T03:15:37.0820931Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:88)
2022-12-20T03:15:37.0821578Z [info]   at org.apache.spark.sql.hive.test.TestHiveSingleton.$init$(TestHiveSingleton.scala:33)
2022-12-20T03:15:37.0822321Z [info]   at org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite.<init>(AlterTableAddColumnsSuite.scala:27)
2022-12-20T03:15:37.0823043Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0823728Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0824474Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0825300Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0825805Z [info]   at java.lang.Class.newInstance(Class.java:442)
2022-12-20T03:15:37.0826341Z [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:454)
2022-12-20T03:15:37.0826959Z [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2022-12-20T03:15:37.0827461Z [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-12-20T03:15:37.0832346Z [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-12-20T03:15:37.0838605Z [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-12-20T03:15:37.0844439Z [info]   at java.lang.Thread.run(Thread.java:750)
2022-12-20T03:15:37.0851150Z [info]   Cause: java.lang.IllegalArgumentException: Not supported: http://javax.xml.XMLConstants/property/accessExternalDTD
2022-12-20T03:15:37.0857679Z [info]   at org.apache.xalan.processor.TransformerFactoryImpl.setAttribute(TransformerFactoryImpl.java:571)
2022-12-20T03:15:37.0863755Z [info]   at org.apache.hadoop.util.XMLUtils.newSecureTransformerFactory(XMLUtils.java:141)
2022-12-20T03:15:37.0869737Z [info]   at org.apache.hadoop.conf.Configuration.writeXml(Configuration.java:3584)
2022-12-20T03:15:37.0875703Z [info]   at org.apache.hadoop.conf.Configuration.writeXml(Configuration.java:3550)
2022-12-20T03:15:37.0881683Z [info]   at org.apache.hadoop.conf.Configuration.writeXml(Configuration.java:3546)
2022-12-20T03:15:37.0887575Z [info]   at org.apache.hadoop.hive.conf.HiveConf.getConfVarInputStream(HiveConf.java:3634)
2022-12-20T03:15:37.0893660Z [info]   at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:4057)
2022-12-20T03:15:37.0898428Z [info]   at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:4014)
2022-12-20T03:15:37.0904308Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl$.newHiveConf(HiveClientImpl.scala:1309)
2022-12-20T03:15:37.0910423Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:176)
2022-12-20T03:15:37.0916293Z [info]   at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:141)
2022-12-20T03:15:37.0921497Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.0927701Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.0932171Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.0938174Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.0943319Z [info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
2022-12-20T03:15:37.0992641Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:514)
2022-12-20T03:15:37.1065786Z [info]   at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:374)
2022-12-20T03:15:37.1066478Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:90)
2022-12-20T03:15:37.1067041Z [info]   at scala.Option.getOrElse(Option.scala:189)
2022-12-20T03:15:37.1067646Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:90)
2022-12-20T03:15:37.1068489Z [info]   at org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:88)
2022-12-20T03:15:37.1069148Z [info]   at org.apache.spark.sql.hive.test.TestHiveSingleton.$init$(TestHiveSingleton.scala:33)
2022-12-20T03:15:37.1069906Z [info]   at org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite.<init>(AlterTableAddColumnsSuite.scala:27)
2022-12-20T03:15:37.1070634Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-20T03:15:37.1071314Z [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-20T03:15:37.1072059Z [info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-20T03:15:37.1072709Z [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2022-12-20T03:15:37.1073209Z [info]   at java.lang.Class.newInstance(Class.java:442)
2022-12-20T03:15:37.1073822Z [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:454)
2022-12-20T03:15:37.1074354Z [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2022-12-20T03:15:37.1074847Z [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2022-12-20T03:15:37.1075432Z [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-12-20T03:15:37.1076054Z [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-12-20T03:15:37.1076558Z [info]   at java.lang.Thread.run(Thread.java:750)

The test failed due to the hive conf failed to initialize after upgrading hadoop 3.3.5, it seems that Spark need to wait for hive to support hadoop 3.3.5 first?

cc @sunchao @dongjoon-hyun FYI

@LuciferYang
Copy link
Contributor Author

also cc @wangyum

@sunchao
Copy link
Member

sunchao commented Dec 20, 2022

cc @steveloughran any idea on what could caused the above error?

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Dec 21, 2022

Maybe due to https://github.com/apache/hadoop/pull/4940/files? Some xml parsers features are disabled, possibly to fix CVE-2022-34169?

@LuciferYang
Copy link
Contributor Author

hmm..., maybe there is some conflict. The attribute ACCESS_EXTERNAL_DTD is not recognized by TransformerFactory

https://github.com/apache/hadoop/blob/5f08e51b72330b2dd2405896b39179a64a3a7efe/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/XMLUtils.java#L141

@LuciferYang
Copy link
Contributor Author

@steveloughran Do you know the correct class type that XMLUtils.newSecureTransformerFactory should return? I want to try to configure javax.xml.transform.TransformerFactory.

@LuciferYang
Copy link
Contributor Author

efec8ce merge with master, then org.apache.spark.sql.hive.execution.command.AlterTableAddColumnsSuite local test pass, let us retry with GA

@steveloughran
Copy link
Contributor

this should have been fixed by "HADOOP-18575. Make XML transformer factory more lenient (#5224)." which is in the "real" rc0 I'm going to put up...that little one we did last week was really an attempt at debugging the process of getting a release built where the x86 code is done on an EC2 VM, arm64 on my laptop, making sure only the x86 artifacts are the ones we publish as staging, rename/resign the arm stuff etc (that bit still needs automation in https://github.com/steveloughran/validate-hadoop-client-artifacts ...)

doing the rc0 release process today

thank you for doing this branch! I'd verified the compile was good, but hadn't run the tests.

@steveloughran
Copy link
Contributor

oh, and the change isn't related to that xalan cve -more that we wanted to put all xml parser/xsl transformer creation into one place and lock them down so as to avoid any risk of some instances being created without secure settings (HADOOP-18469
Add XMLUtils methods to centralise code that creates secure XML parsers)

ironically, sonatype security scans are already warning on hadoop versions without the change...if we hadn't done the lockdown it wouldn't be complaining. Makes you want to not bother, doesn't it?

@LuciferYang
Copy link
Contributor Author

Re-trigger GA found that the dependencies of hadoop 3.3.5 could not be downloaded. Let's wait until downloading is available again to re-analyze the test failed.

@steveloughran
Copy link
Contributor

the real rc0 is up. announcement below. I suspect it will be the transitive jar updates and other lockdown options which create issues...we had to downgrade jackson for tez in HADOOP-18332, then there's jetty. left that alone.

i'd like to see if i can get apache/hadoop#4996 ready for an rc1 so we can cut protobuf 2.5 (which was removed, then reinstated as a dependency). once cut only those apps which need it can add it themselves.

arm binaries too. I'm also wondering if we should do a lean build without the fat shaded aws sdk. we need that so for classpath reasons, it's just so huge as it contains everything, even though nobody is trying to control aws satellite groundstations from big data apps. (analysis, yes. but control. nope, yet it's in com.amazonaws.services.groundstation hence the eternal bloat). cut that jar and the distro is half the size


From: Steve Loughran
Date: Wed, 21 Dec 2022 at 19:28
Subject: [VOTE] Release Apache Hadoop 3.3.5

Mukund and I have put together a release candidate (RC0) for Hadoop 3.3.5.

Given the time of year it's a bit unrealistic to run a 5 day vote and expect people to be able to test it thoroughly enough to make this the one we can ship.

What we would like is for anyone who can to verify the tarballs, and test the binaries, especially anyone who can try the arm64 binaries. We've got the building of those done and now the build file will incorporate them into the release -but neither of us have actually tested it yet. Maybe I should try it on my pi400 over xmas.

The maven artifacts are up on the apache staging repo -they are the ones from x86 build. Building and testing downstream apps will be incredibly helpful.

The RC is available at:
https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.5-RC0/

The git tag is release-3.3.5-RC0, commit 3262495904d

The maven artifacts are staged at
https://repository.apache.org/content/repositories/orgapachehadoop-1365/

You can find my public key at:
https://dist.apache.org/repos/dist/release/hadoop/common/KEYS

Change log
https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.5-RC0/CHANGELOG.md

Release notes
https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.5-RC0/RELEASENOTES.md

This is off branch-3.3 and is the first big release since 3.3.2.

Key changes include

  • Big update of dependencies to try and keep those reports of
    transitive CVEs under control -both genuine and false positive.
  • HDFS RBF enhancements
  • Critical fix to ABFS input stream prefetching for correct reading.
  • Vectored IO API for all FSDataInputStream implementations, with
    high-performance versions for file:// and s3a:// filesystems.
    file:// through java native io
    s3a:// parallel GET requests.
  • This release includes Arm64 binaries. Please can anyone with
    compatible systems validate these.

Please try the release and vote on it, even though i don't know what is a good timeline here...i'm actually going on holiday in early jan. Mukund is around and so can drive the process while I'm offline.

Assuming we do have another iteration, the RC1 will not be before mid jan for that reason

Steve (and mukund)

@LuciferYang
Copy link
Contributor Author

@sunchao @steveloughran @dongjoon-hyun Now all GA Task have passed, except Spark on Kubernetes Integration test, but I think it can also pass when it can be downloaded hadoop 3.3.5 from https://repo1.maven.org/maven2/

@LuciferYang
Copy link
Contributor Author

I will keep this pr open to test the next rc or release in time

@steveloughran
Copy link
Contributor

so the k8s integration test doesn't pick up any -Psnapshots-and-staging profile?

@LuciferYang
Copy link
Contributor Author

so the k8s integration test doesn't pick up any -Psnapshots-and-staging profile?

yes

@steveloughran
Copy link
Contributor

trying to come out with a new RC; few remaining blockers (hdfs IPC regression, some yarn thing and javadocs not getting in to site)

@dongjoon-hyun
Copy link
Member

Thank you so much, @steveloughran .

@LuciferYang
Copy link
Contributor Author

Thanks @steveloughran

@steveloughran
Copy link
Contributor

bTW, I've been testing #39185 on 3.3.5, switching to the new manifest committer added for abfs/gcs commit performance; works well. That change doesn't depend on this PR, it just chooses the new committer if found on the classpath

@LuciferYang
Copy link
Contributor Author

As before, there are no more failed cases

@steveloughran
Copy link
Contributor

got a new RC up to play with...hopefully RC3 will ship. main changes are fixes to some HDFS cases which can trigger NPEs

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @LuciferYang . Could you update this PR to the official release? :)

commons-pool/1.5.4//commons-pool-1.5.4.jar
commons-text/1.10.0//commons-text-1.10.0.jar
compress-lzf/1.1.2//compress-lzf-1.1.2.jar
cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is this introduced back?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but its the version with the updated suffix list. apache/hadoop#4444

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little magical, apache/hadoop#4444 is a version upgrade. I think it should be easier to understand when similar

cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar  -> cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar

changes occur, but when using Hadoop 3.3.4, this dependency does not appear in spark-deps-hadoop-3-hive-2.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this caused by #39124 (comment)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed in Hadoop 3.3.4 (via https://issues.apache.org/jira/browse/HADOOP-18307) but added back in Hadoop 3.3.5

hadoop-client-api/3.3.5//hadoop-client-api-3.3.5.jar
hadoop-client-runtime/3.3.5//hadoop-client-runtime-3.3.5.jar
hadoop-cloud-storage/3.3.5//hadoop-cloud-storage-3.3.5.jar
hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Is this added back?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm. provided it doesn't interfere with everyone else, then getting it means spark will work out the box with that storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just worry about HADOOP-18307 Remove hadoop-cos as a dependency of hadoop-cloud-storage situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Do we need to more additional check for this dependency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious whether similar issue as described in https://issues.apache.org/jira/browse/HADOOP-18159 could happen again if we include hadoop-cos and cos_api-bundle in Spark's class path. We actually just ran into this exact issue recently :)

It'd be nice if there is an easy way to make this optional.

@steveloughran
Copy link
Contributor

the hadoop 3.3.5 release is now officially out.

@dongjoon-hyun
Copy link
Member

Ya, I saw the official Hadoop release and want to resume this, @steveloughran and @LuciferYang . :)

@LuciferYang LuciferYang changed the title [DON'T MERGE] Test build and test with hadoop 3.3.5-RC2 [DON'T MERGE] Test build and test with hadoop 3.3.5 Mar 24, 2023
@LuciferYang
Copy link
Contributor Author

Remove asf staging repository and re test

@LuciferYang LuciferYang changed the title [DON'T MERGE] Test build and test with hadoop 3.3.5 [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 Mar 24, 2023
@LuciferYang
Copy link
Contributor Author

All GA task passed with official release

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we exclude the following dependencies from our side and let the user add them if they need?

cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar

@LuciferYang
Copy link
Contributor Author

Shall we exclude the following dependencies from our side and let the user add them if they need?

cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar

exclude them from hadoop-cloud module

javassist/3.25.0-GA//javassist-3.25.0-GA.jar
javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
javolution/5.5.1//javolution-5.5.1.jar
jaxb-api/2.2.11//jaxb-api-2.2.11.jar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to manually add back this dependency? It disappeared from hadoop-aliyun's dependency chain:

3.3.4

[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.4:compile
[INFO] |  +- org.apache.hadoop:hadoop-aliyun:jar:3.3.4:compile
[INFO] |  |  \- com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile
[INFO] |  |     +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] |  |     +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] |  |     |  \- stax:stax-api:jar:1.0.1:compile
[INFO] |  |     +- com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile
[INFO] |  |     |  +- javax.xml.bind:jaxb-api:jar:2.2.11:compile
[INFO] |  |     |  +- org.ini4j:ini4j:jar:0.5.4:compile
[INFO] |  |     |  +- io.opentracing:opentracing-api:jar:0.33.0:compile
[INFO] |  |     |  \- io.opentracing:opentracing-util:jar:0.33.0:compile
[INFO] |  |     |     \- io.opentracing:opentracing-noop:jar:0.33.0:compile
[INFO] |  |     +- com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile
[INFO] |  |     \- com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile
[INFO] |  \- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.4:compile
[INFO] |     \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile

3.3.5

INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.5:compile
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:3.3.5:compile
[INFO] |  +- org.apache.hadoop:hadoop-aliyun:jar:3.3.5:compile
[INFO] |  |  +- com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile
[INFO] |  |  |  +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] |  |  |  +- com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile
[INFO] |  |  |  |  +- org.ini4j:ini4j:jar:0.5.4:compile
[INFO] |  |  |  |  +- io.opentracing:opentracing-api:jar:0.33.0:compile
[INFO] |  |  |  |  \- io.opentracing:opentracing-util:jar:0.33.0:compile
[INFO] |  |  |  |     \- io.opentracing:opentracing-noop:jar:0.33.0:compile
[INFO] |  |  |  +- com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile
[INFO] |  |  |  \- com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile
[INFO] |  |  \- org.codehaus.jettison:jettison:jar:1.5.3:compile
[INFO] |  \- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.5:compile
[INFO] |     \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/hadoop@72f8c2a

exclude jaxb-api from aliyun-sdk-oss

@steveloughran
Copy link
Contributor

what version of jettison has come in from hadoop-common?

HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports.

@LuciferYang
Copy link
Contributor Author

what version of jettison has come in from hadoop-common?

HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports.

1.5.3

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache Spark 3.5.0 from my side. Thank you, @LuciferYang .

@LuciferYang
Copy link
Contributor Author

Thanks @dongjoon-hyun @sunchao @steveloughran

@dongjoon-hyun
Copy link
Member

Ya, right. I forgot to say that. Thank you so much, @steveloughran and @sunchao too. 😄

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
This pr aims to upgrade Hadoop from 3.3.4 to 3.3.5.

Hadoop 3.3.5 brings many bug fixes as well as CVE fixes, such as

- HADOOP-18333  hadoop-client-runtime impact by CVE-2022-2047 CVE-2022-2048 due to shaded jetty
- HADOOP-18468:  upgrade jettison json jar due to fix CVE-2022-40149
- HADOOP-18493  update jackson-databind 2.12.7.1 due to CVE fixes
- HADOOP-18497  Upgrade commons-text version to fix CVE-2022-42889
- HADOOP-18484  upgrade hsqldb to v2.7.1 due to CVE
- HADOOP-18561  CVE-2021-37533 on commons-net is included in hadoop common and hadoop-client-runtime
- HADOOP-18587  upgrade to jettison 1.5.3 to fix CVE-2022-40150

At the same time, this version brings a high performance vectored read API: HADOOP-18103, this may be used by future versions of `Orc` and `Parquet` to improve read performance.

The release notes and change log as follows:

- https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/RELEASENOTES.3.3.5.html
- https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/CHANGELOG.3.3.5.html

Yes,  `jaxb-api-2.2.11.jar` is no longer in `spark-deps-hadoop-3-hive-2.3` due to HADOOP-18641

Pass GitHub Actions

Closes apache#39124 from LuciferYang/test-hadoop-335.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Aug 4, 2023
### What changes were proposed in this pull request?

This PR aims to downgrade the Apache Hadoop dependency to 3.3.4 in `Apache Spark 3.5` in order to prevent any regression from `Apache Spark 3.4.x`. In other words, although `Apache Spark 3.5.x` will lose many bug fixes of Apache Hadoop 3.3.5 and 3.3.6, it will be in the same situation with `Apache Spark 3.4.x`.
- SPARK-44197 Upgrade Hadoop to 3.3.6 (#41744)
- SPARK-42913 Upgrade Hadoop to 3.3.5 (#39124)
- SPARK-43448 Remove dummy dependency `hadoop-openstack` (#41133)

On top of reverting SPARK-44197 and SPARK-42913, this PR has additional dependency exclusion change due to the following.
- SPARK-43880 Organize `hadoop-cloud` in standard maven project structure (#41380)

### Why are the changes needed?

There is a community report on S3A committer performance regression. Although it's one liner fix, there is no available Hadoop release with that fix at this time.
- HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer (apache/hadoop#5706)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #42345 from dongjoon-hyun/SPARK-44678.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants