Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Sep 4, 2021

What changes were proposed in this pull request?

This patch proposes to use non-shaded Hadoop client libraries.

Why are the changes needed?

Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency.

[info]   Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory                                                                                            
[info]   at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66)
[info]   at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119)                                                                                                         
[info]   at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152)                                                                                                          
[info]   at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)                                                                                                          

I already submitted a PR (HADOOP-17891) to Hadoop to fix it. Before it is released, at Spark side, we either downgrade to 3.3.0 or revert back to non-shaded hadoop client library.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test.

@github-actions github-actions bot added the BUILD label Sep 4, 2021
@viirya viirya changed the title [SPARK-36669][SQL] Revert to non-shaded Hadoop client library [SPARK-36669][SQL][BUILD] Revert to non-shaded Hadoop client library Sep 4, 2021
@viirya viirya changed the title [SPARK-36669][SQL][BUILD] Revert to non-shaded Hadoop client library [SPARK-36669][BUILD] Revert to non-shaded Hadoop client library Sep 4, 2021
@viirya
Copy link
Member Author

viirya commented Sep 4, 2021

when the Hadoop profile is hadoop-2.7, because these are only available in 3.x. Note that,
as result we have to include the same hadoop-client dependency multiple times in hadoop-2.7.
-->
<hadoop-client-api.artifact>hadoop-client-api</hadoop-client-api.artifact>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is open to hear the voices from the reviewers. So don't add some comment here yet.

@SparkQA
Copy link

SparkQA commented Sep 4, 2021

Test build #142993 has finished for PR 33913 at commit 117a07c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Can we have a test coverage for your example, @viirya ?

@SparkQA
Copy link

SparkQA commented Sep 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47495/

@viirya
Copy link
Member Author

viirya commented Sep 4, 2021

Yea, found this issue during adding codec tests in #33912.

@SparkQA
Copy link

SparkQA commented Sep 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47495/

@sunchao
Copy link
Member

sunchao commented Sep 5, 2021

Well this is a bummer. We can't go back to 3.3.0 since it neither support shaded client nor non-shaded client. The only option is 3.2.2.

I think in theory we can use non-shaded client for 3.3.1 but I haven't tried it. You may need to revert more PRs, for instance #33053.

@viirya
Copy link
Member Author

viirya commented Sep 5, 2021

Hmm, yea, it looks more trouble than I thought...I only did this change to test the codec tests (sql). For entire Spark, seems it needs to revert more stuff.

@gengliangwang
Copy link
Member

+1 for a new test case for the issue

@viirya
Copy link
Member Author

viirya commented Sep 5, 2021

@gengliangwang if we run codec test using lz4 in #33912 with current master branch, it will throw the exception as shown in the description.

@HyukjinKwon
Copy link
Member

cc @bozhang2820 who's also interested in this FYI

@pan3793
Copy link
Member

pan3793 commented Sep 6, 2021

Move to Hadoop Shaded Client is a big improvement for Spark 3.2, how about implementing an org.apache.hadoop.shaded.net.jpountz.lz4.LZ4Factory delegate to net.jpountz.lz4.LZ4Factory as a workaround?

@viirya
Copy link
Member Author

viirya commented Sep 6, 2021

Thanks @pan3793.

I tried to add lz4 wrapper classes in #33912. Fortunately only a few lz4 APIs were used at Hadoop Lz4 codec internally. So the wrapper classes are simple.

It can pass the tests locally. Let me know what you think about this idea. @cloud-fan @sunchao @dongjoon-hyun

Basically it sounds good as we don't need to revert shaded hadoop client related stuffs.

@dbtsai
Copy link
Member

dbtsai commented Sep 6, 2021

+1 with the workaround.

@viirya does it mean for snappy, we will have two copies of the snappy-java? One from Spark, and another one shaded hadoop lib?

@dbtsai
Copy link
Member

dbtsai commented Sep 6, 2021

Thanks @pan3793.

I tried to add lz4 wrapper classes in #33912. Fortunately only a few lz4 APIs were used at Hadoop Lz4 codec internally. So the wrapper classes are simple.

It works for parquet, but does it work for compressed lz4 hadoop seq file which will need full Lz4Codec support?

Thanks,

@viirya
Copy link
Member Author

viirya commented Sep 6, 2021

+1 with the workaround.

@viirya does it mean for snappy, we will have two copies of the snappy-java? One from Spark, and another one shaded hadoop lib?

At Hadoop side, snappy-java is not provided but a compile dependency. So it is relocated and included in the shaded client libraries. Spark includes its snappy-java, yes. But they don't conflict as Hadoop relocates its, I think.

@viirya
Copy link
Member Author

viirya commented Sep 6, 2021

It works for parquet, but does it work for compressed lz4 hadoop seq file which will need full Lz4Codec support?

Thanks,

lz4-java APIs are only used internally in Hadoop Lz4Compressor and Lz4Decompressor, not by Lz4Codec. The added wrapper classes already implement all used lz4-java APIs there, Hadoop usage should be fine for both parquet, sequence file. I will also run a test to verify it too.

Actually maybe we also need to add some e2e tests for seq file in Spark too.

@sunchao
Copy link
Member

sunchao commented Sep 6, 2021

+1 on adding the wrapper as a workaround

@pan3793
Copy link
Member

pan3793 commented Sep 8, 2021

Does this help for snappy-java?

Fixed the pure-java Snappy fallback logic when no native library for your platform is found.

https://github.com/xerial/snappy-java/releases/tag/1.1.8.2

@viirya
Copy link
Member Author

viirya commented Sep 8, 2021

Does this help for snappy-java?

Fixed the pure-java Snappy fallback logic when no native library for your platform is found.

https://github.com/xerial/snappy-java/releases/tag/1.1.8.2

I think it doesn't. Actually the native library is relocated, snappy library can find it and load it. But when JNI resolves native method, it cannot resolve the defined native methods because relocation doesn't work on native methods.

BTW, Hadoop 3.3.1 already uses snappy-java 1.1.8.2.

@pan3793
Copy link
Member

pan3793 commented Sep 8, 2021

What if force set org.apache.hadoop.shaded.org.xerial.snappy.purejava = true?

try {
    if (Boolean.parseBoolean(System.getProperty("org.apache.hadoop.shaded.org.xerial.snappy.purejava", "false"))) {
        setSnappyApi(new PureJavaSnappy());
    } else {
        loadNativeLibrary();
        setSnappyApi(new SnappyNative());
    }
} catch (Throwable var1) {
    setSnappyApi(new PureJavaSnappy());
}

@viirya viirya closed this Sep 10, 2021
@viirya viirya deleted the SPARK-36669 branch December 27, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants