[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite #33912

viirya · 2021-09-04T21:55:46Z

What changes were proposed in this pull request?

This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.

Why are the changes needed?

We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests.

viirya · 2021-09-04T21:58:58Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecTestSuite.scala

Found this issue during adding these tests. Created SPARK-36669 for it.

See #33913 too.

SparkQA · 2021-09-04T23:02:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47496/

SparkQA · 2021-09-04T23:44:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47496/

SparkQA · 2021-09-05T07:02:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47498/

SparkQA · 2021-09-05T07:57:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47498/

SparkQA · 2021-09-05T08:06:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47499/

SparkQA · 2021-09-05T08:14:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47499/

SparkQA · 2021-09-05T09:51:32Z

Test build #142997 has finished for PR 33912 at commit b364990.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-09-05T16:51:56Z

cc @cloud-fan @dongjoon-hyun @sunchao

dongjoon-hyun · 2021-09-05T20:24:45Z

Thank you, @viirya .

dongjoon-hyun · 2021-09-05T20:27:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcCodecTestSuite.scala

+  override def dataSourceName: String = "orc"
+  override val codecConfigName = SQLConf.ORC_COMPRESSION.key
+  override protected def availableCodecs = Seq("none", "uncompressed", "snappy",
+    "zlib", "zstd", "lz4", "lzo")


To reviewers: As you see here, Apache ORC has no issue because it uses AircompressorCodec LZ4.

dongjoon-hyun · 2021-09-05T20:49:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcCodecTestSuite.scala

+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSparkSession
+
+class OrcCodecTestSuite extends DataSourceCodecTest with SharedSparkSession{


nit, OrcCodecTestSuite -> OrcCodecSuite?

dongjoon-hyun · 2021-09-05T20:49:40Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecTestSuite.scala

+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSparkSession
+
+class ParquetCodecTestSuite extends DataSourceCodecTest with SharedSparkSession {


nit, ParquetCodecTestSuite -> ParquetCodecSuite

dongjoon-hyun · 2021-09-05T20:53:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceCodecTest.scala

+    }
+  }
+
+  testWithAllCodecs("write and read - single partition") {


This test case seems to be included in write and read. Do we need this test case separately?

Only the partition number is different.

Ya, it looks like that. In that case, there is no difference in terms of the test coverage.

dongjoon-hyun · 2021-09-05T20:55:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceCodecTest.scala

+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SQLTestUtils
+
+abstract class DataSourceCodecTest extends QueryTest with SQLTestUtils {


If we are going to test only file-based data source, we can make a single simple suite like FileBasedDataSourceSuite.

Maybe, FileBasedDataSourceCodecSuite?

private val allFileBasedDataSources = Seq("orc", "parquet", ...) allFileBasedDataSources.foreach { format => test(s"... - $format") {

ok, sounds good. let me refactor it.

Actually, this code style is a bit hard to extend (e.g. how to test avro?) and I was planning to refactor the existing test suites as well.

I think a better solution is

trait FileSourceCodecSuite ... { def format: String ... } class ParquetCodecSuite extends FileSourceCodecSuite class OrcCodecSuite ...

@cloud-fan's idea is more close to what I have in mind at the beginning. @dongjoon-hyun is it good for you too?

Note: I think these test suites can be in one file if possible.

Yes, @viirya and @cloud-fan , +1 for the proposed structure in a single file.

SparkQA · 2021-09-06T08:18:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47517/

SparkQA · 2021-09-06T08:26:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47517/

cloud-fan · 2021-09-06T08:28:00Z

core/src/main/java/org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Compressor.java

+/**
+ * A temporary workaround for SPARK-36669. We should remove this after Hadoop 3.3.2 release
+ * which fixes the LZ4 relocation in shaded Hadoop client libraries. This does not need
+ * implement all net.jpountz.lz4.LZ4Compressor API, just the ones used by Hadoop Lz4Compressor.


Now this is not a test-only PR. Can we update the PR title and description accordingly?

yea, let me update it.

SparkQA · 2021-09-07T04:55:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47539/

SparkQA · 2021-09-07T04:56:47Z

Test build #143033 has finished for PR 33912 at commit 0029f33.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcCodecSuite extends FileSourceCodecSuite with SharedSparkSession

dbtsai · 2021-09-07T05:10:49Z

Could we add a test for hadoop seq files using sc.sequenceFile(...)? There are still many legacy applications using hadoop seq files, and we want to ensure it works.

We might want to exclude the relocation of snappy in Hadoop as well. The relocation only relocates the java classes, and the native jni interfaces will not be relocated. So let's say we include two different version of snappy-java, and one is the relocated one provided by Hadoop and the other one is the one provided by Spark. If snappy-java decides to change the native C interfaces, since those native methods can not be relocated, it will cause the incompatibility issue in loading the native methods. If both of them are non-relocated, and the dependency resolution will ensure we only include one version of snappy-java to avoid the potential incompatibility issue from native interface which can not technically be relocated.

I remember @dongjoon-hyun saw this issue when he worked on the zstd-jni in Spark and Iceberg.

SparkQA · 2021-09-07T05:54:49Z

Test build #143036 has finished for PR 33912 at commit b6f20cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-09-07T07:18:48Z

Could we add a test for hadoop seq files using sc.sequenceFile(...)? There are still many legacy applications using hadoop seq files, and we want to ensure it works.

We might want to exclude the relocation of snappy in Hadoop as well.

Let me add test for hadoop seq files in different PR.

For snappy-java, that's good point. Looks like we also need to exclude it from relocation in Hadoop.

viirya · 2021-09-07T07:46:05Z

@dbtsai Added Hadoop sequence file tests in #33924.

viirya · 2021-09-07T07:51:46Z

We might want to exclude the relocation of snappy in Hadoop as well. The relocation only relocates the java classes, and the native jni interfaces will not be relocated. So let's say we include two different version of snappy-java, and one is the relocated one provided by Hadoop and the other one is the one provided by Spark. If snappy-java decides to change the native C interfaces, since those native methods can not be relocated, it will cause the incompatibility issue in loading the native methods. If both of them are non-relocated, and the dependency resolution will ensure we only include one version of snappy-java to avoid the potential incompatibility issue from native interface which can not technically be relocated.

Not only for compatibility issue. Actually native library shouldn't be relocated due to JNI method resolution. The relocated snappy-java cannot resolve native methods in SnappyCodec. Created another blocker SPARK-36681 for the issue. We definitely should exclude snappy-java from relocation at Hadoop. I'm excluding it together with lz4-java in the same Hadoop PR.

Different to lz4-java issue that we can add wrappers as workaround, for snappy-java as it is both relocated and included in the client libraries, seems to me this kind of workaround cannot work.

viirya · 2021-09-07T16:48:10Z

As the workaround cannot work on snappy-java, it makes less sense to add the wrapper classes here. I will remove them and keep only the codec working for now. We will deal the lz4-java and snappy-java issues in Hadoop and in separate JIRAs.

dongjoon-hyun · 2021-09-07T16:58:25Z

Thank you for the update, @viirya .

dbtsai · 2021-09-07T17:19:19Z

LGTM. @viirya do we have a separate PR for compressed hadoop seq files? Thanks.

dbtsai · 2021-09-07T17:20:19Z

pom.xml

        <enabled>false</enabled>
      </snapshots>
    </repository>
+


nit, remove empty line?

gengliangwang · 2021-09-07T17:35:06Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala

+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+trait FileSourceCodecSuite extends QueryTest with SQLTestUtils {


Nit: let's make it

trait FileSourceCodecSuite extends QueryTest with SharedSparkSession

here so that ParquetCodecSuite doesn't need to extend SharedSparkSession

viirya · 2021-09-07T17:38:03Z

LGTM. @viirya do we have a separate PR for compressed hadoop seq files? Thanks.

Yea, the PR is #33924. Thanks.

SparkQA · 2021-09-07T18:21:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47558/

SparkQA · 2021-09-07T19:18:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47558/

SparkQA · 2021-09-07T20:13:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47562/

SparkQA · 2021-09-07T20:54:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47562/

SparkQA · 2021-09-07T22:13:09Z

Test build #143059 has finished for PR 33912 at commit 16e7db9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait FileSourceCodecSuite extends QueryTest with SQLTestUtils with SharedSparkSession
class ParquetCodecSuite extends FileSourceCodecSuite
class OrcCodecSuite extends FileSourceCodecSuite

viirya · 2021-09-07T23:52:23Z

Thanks for reviewing! Merging to master/3.2.

### What changes were proposed in this pull request? This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. ### Why are the changes needed? We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 5a0ae69) Signed-off-by: Liang-Chi Hsieh <[email protected]>

This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. No Added tests. Closes apache#33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 5a0ae69) Signed-off-by: Liang-Chi Hsieh <[email protected]>

github-actions bot added BUILD SQL labels Sep 4, 2021

viirya commented Sep 4, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

viirya mentioned this pull request Sep 4, 2021

[SPARK-36669][BUILD] Revert to non-shaded Hadoop client library #33913

Closed

viirya force-pushed the SPARK-36670 branch from 66c30cd to 55f36bc Compare September 5, 2021 06:11

This comment has been minimized.

Sign in to view

Add e2e test cases for codec.

b364990

viirya force-pushed the SPARK-36670 branch from 55f36bc to b364990 Compare September 5, 2021 07:12

dongjoon-hyun reviewed Sep 5, 2021

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-36670][SQL][TEST] Add end-to-end codec test cases for main datasources~~ [SPARK-36670][SQL][TEST] Add end-to-end codec test cases for ORC/Parquet datasources Sep 5, 2021

dongjoon-hyun reviewed Sep 5, 2021

View reviewed changes

Add lz4 wrapper classes as workaround for SPARK-36669.

cbb6f0a

github-actions bot added the CORE label Sep 6, 2021

cloud-fan reviewed Sep 6, 2021

View reviewed changes

viirya changed the title ~~[SPARK-36670][SQL][TEST] Add end-to-end codec test cases for ORC/Parquet datasources~~ [SPARK-36670][SQL] Add end-to-end codec test cases for ORC/Parquet datasources and LZ4 hadoop wrapper Sep 6, 2021

This comment has been minimized.

Sign in to view

Remove wrapper classes. Keep only working codec.

e76393b

viirya changed the title ~~[SPARK-36670][SPARK-36669][CORE][SQL] Add LZ4 hadoop wrapper and FileSourceCodecSuite~~ [SPARK-36670][SQL][TEST] Add FileSourceCodecSuite Sep 7, 2021

cloud-fan approved these changes Sep 7, 2021

View reviewed changes

dbtsai reviewed Sep 7, 2021

View reviewed changes

pom.xml Outdated

<enabled>false</enabled>

</snapshots>

</repository>

Copy link

Member

dbtsai Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, remove empty line?

gengliangwang reviewed Sep 7, 2021

View reviewed changes

gengliangwang approved these changes Sep 7, 2021

View reviewed changes

For review comments.

16e7db9

This comment has been minimized.

Sign in to view

viirya closed this in 5a0ae69 Sep 7, 2021

viirya deleted the SPARK-36670 branch September 7, 2021 23:53

Uh oh!

[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite #33912

[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite #33912

Uh oh!

Conversation

viirya commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

SparkQA commented Sep 4, 2021

Uh oh!

SparkQA commented Sep 4, 2021

Uh oh!

This comment has been minimized.

SparkQA commented Sep 5, 2021

Uh oh!

SparkQA commented Sep 5, 2021

Uh oh!

SparkQA commented Sep 5, 2021

Uh oh!

SparkQA commented Sep 5, 2021

Uh oh!

SparkQA commented Sep 5, 2021

Uh oh!

viirya commented Sep 5, 2021

Uh oh!

dongjoon-hyun commented Sep 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 6, 2021

Uh oh!

SparkQA commented Sep 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

dbtsai commented Sep 7, 2021

Uh oh!

viirya commented Sep 4, 2021 •

edited

Loading

cloud-fan Sep 6, 2021 •

edited

Loading

viirya commented Sep 7, 2021 •

edited

Loading