-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-36669][BUILD] Revert to non-shaded Hadoop client library #33913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| when the Hadoop profile is hadoop-2.7, because these are only available in 3.x. Note that, | ||
| as result we have to include the same hadoop-client dependency multiple times in hadoop-2.7. | ||
| --> | ||
| <hadoop-client-api.artifact>hadoop-client-api</hadoop-client-api.artifact> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is open to hear the voices from the reviewers. So don't add some comment here yet.
|
Test build #142993 has finished for PR 33913 at commit
|
|
Can we have a test coverage for your example, @viirya ? |
|
Kubernetes integration test starting |
|
Yea, found this issue during adding codec tests in #33912. |
|
Kubernetes integration test status failure |
|
Well this is a bummer. We can't go back to 3.3.0 since it neither support shaded client nor non-shaded client. The only option is 3.2.2. I think in theory we can use non-shaded client for 3.3.1 but I haven't tried it. You may need to revert more PRs, for instance #33053. |
|
Hmm, yea, it looks more trouble than I thought...I only did this change to test the codec tests (sql). For entire Spark, seems it needs to revert more stuff. |
|
+1 for a new test case for the issue |
|
@gengliangwang if we run codec test using lz4 in #33912 with current master branch, it will throw the exception as shown in the description. |
|
cc @bozhang2820 who's also interested in this FYI |
|
Move to Hadoop Shaded Client is a big improvement for Spark 3.2, how about implementing an |
|
Thanks @pan3793. I tried to add lz4 wrapper classes in #33912. Fortunately only a few lz4 APIs were used at Hadoop Lz4 codec internally. So the wrapper classes are simple. It can pass the tests locally. Let me know what you think about this idea. @cloud-fan @sunchao @dongjoon-hyun Basically it sounds good as we don't need to revert shaded hadoop client related stuffs. |
|
+1 with the workaround. @viirya does it mean for snappy, we will have two copies of the snappy-java? One from Spark, and another one shaded hadoop lib? |
At Hadoop side, snappy-java is not provided but a compile dependency. So it is relocated and included in the shaded client libraries. Spark includes its snappy-java, yes. But they don't conflict as Hadoop relocates its, I think. |
lz4-java APIs are only used internally in Hadoop Lz4Compressor and Lz4Decompressor, not by Lz4Codec. The added wrapper classes already implement all used lz4-java APIs there, Hadoop usage should be fine for both parquet, sequence file. I will also run a test to verify it too. Actually maybe we also need to add some e2e tests for seq file in Spark too. |
|
+1 on adding the wrapper as a workaround |
|
Does this help for snappy-java?
https://github.com/xerial/snappy-java/releases/tag/1.1.8.2 |
I think it doesn't. Actually the native library is relocated, snappy library can find it and load it. But when JNI resolves native method, it cannot resolve the defined native methods because relocation doesn't work on native methods. BTW, Hadoop 3.3.1 already uses snappy-java 1.1.8.2. |
|
What if force set |
What changes were proposed in this pull request?
This patch proposes to use non-shaded Hadoop client libraries.
Why are the changes needed?
Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for
Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency.I already submitted a PR (HADOOP-17891) to Hadoop to fix it. Before it is released, at Spark side, we either downgrade to 3.3.0 or revert back to non-shaded hadoop client library.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually test.