-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 #39124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Many test failed similart to the follows: The test failed due to the hive conf failed to initialize after upgrading hadoop 3.3.5, it seems that Spark need to wait for hive to support hadoop 3.3.5 first? cc @sunchao @dongjoon-hyun FYI |
|
also cc @wangyum |
|
cc @steveloughran any idea on what could caused the above error? |
|
Maybe due to https://github.com/apache/hadoop/pull/4940/files? Some xml parsers features are disabled, possibly to fix CVE-2022-34169? |
|
hmm..., maybe there is some conflict. The attribute ACCESS_EXTERNAL_DTD is not recognized by TransformerFactory |
|
@steveloughran Do you know the correct class type that |
|
efec8ce merge with master, then |
|
this should have been fixed by "HADOOP-18575. Make XML transformer factory more lenient (#5224)." which is in the "real" rc0 I'm going to put up...that little one we did last week was really an attempt at debugging the process of getting a release built where the x86 code is done on an EC2 VM, arm64 on my laptop, making sure only the x86 artifacts are the ones we publish as staging, rename/resign the arm stuff etc (that bit still needs automation in https://github.com/steveloughran/validate-hadoop-client-artifacts ...) doing the rc0 release process today thank you for doing this branch! I'd verified the compile was good, but hadn't run the tests. |
|
oh, and the change isn't related to that xalan cve -more that we wanted to put all xml parser/xsl transformer creation into one place and lock them down so as to avoid any risk of some instances being created without secure settings (HADOOP-18469 ironically, sonatype security scans are already warning on hadoop versions without the change...if we hadn't done the lockdown it wouldn't be complaining. Makes you want to not bother, doesn't it? |
|
Re-trigger GA found that the dependencies of hadoop 3.3.5 could not be downloaded. Let's wait until downloading is available again to re-analyze the test failed. |
|
the real rc0 is up. announcement below. I suspect it will be the transitive jar updates and other lockdown options which create issues...we had to downgrade jackson for tez in HADOOP-18332, then there's jetty. left that alone. i'd like to see if i can get apache/hadoop#4996 ready for an rc1 so we can cut protobuf 2.5 (which was removed, then reinstated as a dependency). once cut only those apps which need it can add it themselves. arm binaries too. I'm also wondering if we should do a lean build without the fat shaded aws sdk. we need that so for classpath reasons, it's just so huge as it contains everything, even though nobody is trying to control aws satellite groundstations from big data apps. (analysis, yes. but control. nope, yet it's in From: Steve Loughran Mukund and I have put together a release candidate (RC0) for Hadoop 3.3.5. Given the time of year it's a bit unrealistic to run a 5 day vote and expect people to be able to test it thoroughly enough to make this the one we can ship. What we would like is for anyone who can to verify the tarballs, and test the binaries, especially anyone who can try the arm64 binaries. We've got the building of those done and now the build file will incorporate them into the release -but neither of us have actually tested it yet. Maybe I should try it on my pi400 over xmas. The maven artifacts are up on the apache staging repo -they are the ones from x86 build. Building and testing downstream apps will be incredibly helpful. The RC is available at: The git tag is release-3.3.5-RC0, commit 3262495904d The maven artifacts are staged at You can find my public key at: Change log Release notes This is off branch-3.3 and is the first big release since 3.3.2. Key changes include
Please try the release and vote on it, even though i don't know what is a good timeline here...i'm actually going on holiday in early jan. Mukund is around and so can drive the process while I'm offline. Assuming we do have another iteration, the RC1 will not be before mid jan for that reason Steve (and mukund) |
|
@sunchao @steveloughran @dongjoon-hyun Now all GA Task have passed, except |
|
I will keep this pr open to test the next rc or release in time |
|
so the k8s integration test doesn't pick up any -Psnapshots-and-staging profile? |
yes |
|
trying to come out with a new RC; few remaining blockers (hdfs IPC regression, some yarn thing and javadocs not getting in to site) |
|
Thank you so much, @steveloughran . |
|
Thanks @steveloughran |
|
bTW, I've been testing #39185 on 3.3.5, switching to the new manifest committer added for abfs/gcs commit performance; works well. That change doesn't depend on this PR, it just chooses the new committer if found on the classpath |
|
As before, there are no more failed cases |
|
got a new RC up to play with...hopefully RC3 will ship. main changes are fixes to some HDFS cases which can trigger NPEs |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @LuciferYang . Could you update this PR to the official release? :)
| commons-pool/1.5.4//commons-pool-1.5.4.jar | ||
| commons-text/1.10.0//commons-text-1.10.0.jar | ||
| compress-lzf/1.1.2//compress-lzf-1.1.2.jar | ||
| cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is this introduced back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but its the version with the updated suffix list. apache/hadoop#4444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little magical, apache/hadoop#4444 is a version upgrade. I think it should be easier to understand when similar
cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar -> cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
changes occur, but when using Hadoop 3.3.4, this dependency does not appear in spark-deps-hadoop-3-hive-2.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this caused by #39124 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was removed in Hadoop 3.3.4 (via https://issues.apache.org/jira/browse/HADOOP-18307) but added back in Hadoop 3.3.5
| hadoop-client-api/3.3.5//hadoop-client-api-3.3.5.jar | ||
| hadoop-client-runtime/3.3.5//hadoop-client-runtime-3.3.5.jar | ||
| hadoop-cloud-storage/3.3.5//hadoop-cloud-storage-3.3.5.jar | ||
| hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Is this added back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm. provided it doesn't interfere with everyone else, then getting it means spark will work out the box with that storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just worry about HADOOP-18307 Remove hadoop-cos as a dependency of hadoop-cloud-storage situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Do we need to more additional check for this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just curious whether similar issue as described in https://issues.apache.org/jira/browse/HADOOP-18159 could happen again if we include hadoop-cos and cos_api-bundle in Spark's class path. We actually just ran into this exact issue recently :)
It'd be nice if there is an easy way to make this optional.
|
the hadoop 3.3.5 release is now officially out. |
|
Ya, I saw the official Hadoop release and want to resume this, @steveloughran and @LuciferYang . :) |
|
Remove asf staging repository and re test |
|
All GA task passed with official release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we exclude the following dependencies from our side and let the user add them if they need?
cos_api-bundle/5.6.69//cos_api-bundle-5.6.69.jar
hadoop-cos/3.3.5//hadoop-cos-3.3.5.jar
exclude them from |
| javassist/3.25.0-GA//javassist-3.25.0-GA.jar | ||
| javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar | ||
| javolution/5.5.1//javolution-5.5.1.jar | ||
| jaxb-api/2.2.11//jaxb-api-2.2.11.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to manually add back this dependency? It disappeared from hadoop-aliyun's dependency chain:
3.3.4
[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.4:compile
[INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.3.4:compile
[INFO] | | \- com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile
[INFO] | | +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] | | +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] | | | \- stax:stax-api:jar:1.0.1:compile
[INFO] | | +- com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile
[INFO] | | | +- javax.xml.bind:jaxb-api:jar:2.2.11:compile
[INFO] | | | +- org.ini4j:ini4j:jar:0.5.4:compile
[INFO] | | | +- io.opentracing:opentracing-api:jar:0.33.0:compile
[INFO] | | | \- io.opentracing:opentracing-util:jar:0.33.0:compile
[INFO] | | | \- io.opentracing:opentracing-noop:jar:0.33.0:compile
[INFO] | | +- com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile
[INFO] | | \- com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile
[INFO] | \- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.4:compile
[INFO] | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile
3.3.5
INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.5:compile
[INFO] | +- org.apache.hadoop:hadoop-annotations:jar:3.3.5:compile
[INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.3.5:compile
[INFO] | | +- com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile
[INFO] | | | +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] | | | +- com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile
[INFO] | | | | +- org.ini4j:ini4j:jar:0.5.4:compile
[INFO] | | | | +- io.opentracing:opentracing-api:jar:0.33.0:compile
[INFO] | | | | \- io.opentracing:opentracing-util:jar:0.33.0:compile
[INFO] | | | | \- io.opentracing:opentracing-noop:jar:0.33.0:compile
[INFO] | | | +- com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile
[INFO] | | | \- com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile
[INFO] | | \- org.codehaus.jettison:jettison:jar:1.5.3:compile
[INFO] | \- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.5:compile
[INFO] | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exclude jaxb-api from aliyun-sdk-oss
|
what version of jettison has come in from hadoop-common? HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports. |
1.5.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM for Apache Spark 3.5.0 from my side. Thank you, @LuciferYang .
|
Thanks @dongjoon-hyun @sunchao @steveloughran |
|
Ya, right. I forgot to say that. Thank you so much, @steveloughran and @sunchao too. 😄 |
This pr aims to upgrade Hadoop from 3.3.4 to 3.3.5. Hadoop 3.3.5 brings many bug fixes as well as CVE fixes, such as - HADOOP-18333 hadoop-client-runtime impact by CVE-2022-2047 CVE-2022-2048 due to shaded jetty - HADOOP-18468: upgrade jettison json jar due to fix CVE-2022-40149 - HADOOP-18493 update jackson-databind 2.12.7.1 due to CVE fixes - HADOOP-18497 Upgrade commons-text version to fix CVE-2022-42889 - HADOOP-18484 upgrade hsqldb to v2.7.1 due to CVE - HADOOP-18561 CVE-2021-37533 on commons-net is included in hadoop common and hadoop-client-runtime - HADOOP-18587 upgrade to jettison 1.5.3 to fix CVE-2022-40150 At the same time, this version brings a high performance vectored read API: HADOOP-18103, this may be used by future versions of `Orc` and `Parquet` to improve read performance. The release notes and change log as follows: - https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/RELEASENOTES.3.3.5.html - https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/CHANGELOG.3.3.5.html Yes, `jaxb-api-2.2.11.jar` is no longer in `spark-deps-hadoop-3-hive-2.3` due to HADOOP-18641 Pass GitHub Actions Closes apache#39124 from LuciferYang/test-hadoop-335. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to downgrade the Apache Hadoop dependency to 3.3.4 in `Apache Spark 3.5` in order to prevent any regression from `Apache Spark 3.4.x`. In other words, although `Apache Spark 3.5.x` will lose many bug fixes of Apache Hadoop 3.3.5 and 3.3.6, it will be in the same situation with `Apache Spark 3.4.x`. - SPARK-44197 Upgrade Hadoop to 3.3.6 (#41744) - SPARK-42913 Upgrade Hadoop to 3.3.5 (#39124) - SPARK-43448 Remove dummy dependency `hadoop-openstack` (#41133) On top of reverting SPARK-44197 and SPARK-42913, this PR has additional dependency exclusion change due to the following. - SPARK-43880 Organize `hadoop-cloud` in standard maven project structure (#41380) ### Why are the changes needed? There is a community report on S3A committer performance regression. Although it's one liner fix, there is no available Hadoop release with that fix at this time. - HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer (apache/hadoop#5706) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #42345 from dongjoon-hyun/SPARK-44678. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This pr aims to upgrade Hadoop from 3.3.4 to 3.3.5.
Why are the changes needed?
Hadoop 3.3.5 brings many bug fixes as well as CVE fixes, such as
At the same time, this version brings a high performance vectored read API: HADOOP-18103, this may be used by future versions of
OrcandParquetto improve read performance.The release notes and change log as follows:
Does this PR introduce any user-facing change?
Yes,
jaxb-api-2.2.11.jaris no longer inspark-deps-hadoop-3-hive-2.3due to HADOOP-18641How was this patch tested?
Pass GitHub Actions