Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Sep 4, 2021

What changes were proposed in this pull request?

This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.

Why are the changes needed?

We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this issue during adding these tests. Created SPARK-36669 for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #33913 too.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47496/

@SparkQA
Copy link

SparkQA commented Sep 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47496/

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47498/

@SparkQA
Copy link

SparkQA commented Sep 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47498/

@SparkQA
Copy link

SparkQA commented Sep 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47499/

@SparkQA
Copy link

SparkQA commented Sep 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47499/

@SparkQA
Copy link

SparkQA commented Sep 5, 2021

Test build #142997 has finished for PR 33912 at commit b364990.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 5, 2021

@dongjoon-hyun
Copy link
Member

Thank you, @viirya .

override def dataSourceName: String = "orc"
override val codecConfigName = SQLConf.ORC_COMPRESSION.key
override protected def availableCodecs = Seq("none", "uncompressed", "snappy",
"zlib", "zstd", "lz4", "lzo")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers: As you see here, Apache ORC has no issue because it uses AircompressorCodec LZ4.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-36670][SQL][TEST] Add end-to-end codec test cases for main datasources [SPARK-36670][SQL][TEST] Add end-to-end codec test cases for ORC/Parquet datasources Sep 5, 2021
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.SharedSparkSession

class OrcCodecTestSuite extends DataSourceCodecTest with SharedSparkSession{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, OrcCodecTestSuite -> OrcCodecSuite?

import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.SharedSparkSession

class ParquetCodecTestSuite extends DataSourceCodecTest with SharedSparkSession {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, ParquetCodecTestSuite -> ParquetCodecSuite

}
}

testWithAllCodecs("write and read - single partition") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case seems to be included in write and read. Do we need this test case separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the partition number is different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, it looks like that. In that case, there is no difference in terms of the test coverage.

import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SQLTestUtils

abstract class DataSourceCodecTest extends QueryTest with SQLTestUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to test only file-based data source, we can make a single simple suite like FileBasedDataSourceSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, FileBasedDataSourceCodecSuite?

  private val allFileBasedDataSources = Seq("orc", "parquet", ...)

  allFileBasedDataSources.foreach { format =>
    test(s"... - $format") {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good. let me refactor it.

Copy link
Contributor

@cloud-fan cloud-fan Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this code style is a bit hard to extend (e.g. how to test avro?) and I was planning to refactor the existing test suites as well.

I think a better solution is

trait FileSourceCodecSuite ... {
  def format: String
  ...
}

class ParquetCodecSuite extends FileSourceCodecSuite

class OrcCodecSuite ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan's idea is more close to what I have in mind at the beginning. @dongjoon-hyun is it good for you too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think these test suites can be in one file if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @viirya and @cloud-fan , +1 for the proposed structure in a single file.

@github-actions github-actions bot added the CORE label Sep 6, 2021
@SparkQA
Copy link

SparkQA commented Sep 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47517/

@SparkQA
Copy link

SparkQA commented Sep 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47517/

/**
* A temporary workaround for SPARK-36669. We should remove this after Hadoop 3.3.2 release
* which fixes the LZ4 relocation in shaded Hadoop client libraries. This does not need
* implement all net.jpountz.lz4.LZ4Compressor API, just the ones used by Hadoop Lz4Compressor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is not a test-only PR. Can we update the PR title and description accordingly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, let me update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@viirya viirya changed the title [SPARK-36670][SQL][TEST] Add end-to-end codec test cases for ORC/Parquet datasources [SPARK-36670][SQL] Add end-to-end codec test cases for ORC/Parquet datasources and LZ4 hadoop wrapper Sep 6, 2021
@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47539/

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Test build #143033 has finished for PR 33912 at commit 0029f33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OrcCodecSuite extends FileSourceCodecSuite with SharedSparkSession

@dbtsai
Copy link
Member

dbtsai commented Sep 7, 2021

Could we add a test for hadoop seq files using sc.sequenceFile(...)? There are still many legacy applications using hadoop seq files, and we want to ensure it works.

We might want to exclude the relocation of snappy in Hadoop as well. The relocation only relocates the java classes, and the native jni interfaces will not be relocated. So let's say we include two different version of snappy-java, and one is the relocated one provided by Hadoop and the other one is the one provided by Spark. If snappy-java decides to change the native C interfaces, since those native methods can not be relocated, it will cause the incompatibility issue in loading the native methods. If both of them are non-relocated, and the dependency resolution will ensure we only include one version of snappy-java to avoid the potential incompatibility issue from native interface which can not technically be relocated.

I remember @dongjoon-hyun saw this issue when he worked on the zstd-jni in Spark and Iceberg.

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Test build #143036 has finished for PR 33912 at commit b6f20cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2021

Could we add a test for hadoop seq files using sc.sequenceFile(...)? There are still many legacy applications using hadoop seq files, and we want to ensure it works.

We might want to exclude the relocation of snappy in Hadoop as well.

Let me add test for hadoop seq files in different PR.

For snappy-java, that's good point. Looks like we also need to exclude it from relocation in Hadoop.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2021

@dbtsai Added Hadoop sequence file tests in #33924.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2021

We might want to exclude the relocation of snappy in Hadoop as well. The relocation only relocates the java classes, and the native jni interfaces will not be relocated. So let's say we include two different version of snappy-java, and one is the relocated one provided by Hadoop and the other one is the one provided by Spark. If snappy-java decides to change the native C interfaces, since those native methods can not be relocated, it will cause the incompatibility issue in loading the native methods. If both of them are non-relocated, and the dependency resolution will ensure we only include one version of snappy-java to avoid the potential incompatibility issue from native interface which can not technically be relocated.

Not only for compatibility issue. Actually native library shouldn't be relocated due to JNI method resolution. The relocated snappy-java cannot resolve native methods in SnappyCodec. Created another blocker SPARK-36681 for the issue. We definitely should exclude snappy-java from relocation at Hadoop. I'm excluding it together with lz4-java in the same Hadoop PR.

Different to lz4-java issue that we can add wrappers as workaround, for snappy-java as it is both relocated and included in the client libraries, seems to me this kind of workaround cannot work.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2021

As the workaround cannot work on snappy-java, it makes less sense to add the wrapper classes here. I will remove them and keep only the codec working for now. We will deal the lz4-java and snappy-java issues in Hadoop and in separate JIRAs.

@dongjoon-hyun
Copy link
Member

Thank you for the update, @viirya .

@viirya viirya changed the title [SPARK-36670][SPARK-36669][CORE][SQL] Add LZ4 hadoop wrapper and FileSourceCodecSuite [SPARK-36670][SQL][TEST] Add FileSourceCodecSuite Sep 7, 2021
@dbtsai
Copy link
Member

dbtsai commented Sep 7, 2021

LGTM. @viirya do we have a separate PR for compressed hadoop seq files? Thanks.

pom.xml Outdated
<enabled>false</enabled>
</snapshots>
</repository>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, remove empty line?

import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}

trait FileSourceCodecSuite extends QueryTest with SQLTestUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's make it

trait FileSourceCodecSuite extends QueryTest with SharedSparkSession 

here so that ParquetCodecSuite doesn't need to extend SharedSparkSession

@viirya
Copy link
Member Author

viirya commented Sep 7, 2021

LGTM. @viirya do we have a separate PR for compressed hadoop seq files? Thanks.

Yea, the PR is #33924. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47558/

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47558/

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47562/

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47562/

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Test build #143059 has finished for PR 33912 at commit 16e7db9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait FileSourceCodecSuite extends QueryTest with SQLTestUtils with SharedSparkSession
  • class ParquetCodecSuite extends FileSourceCodecSuite
  • class OrcCodecSuite extends FileSourceCodecSuite

@viirya
Copy link
Member Author

viirya commented Sep 7, 2021

Thanks for reviewing! Merging to master/3.2.

@viirya viirya closed this in 5a0ae69 Sep 7, 2021
viirya added a commit that referenced this pull request Sep 7, 2021
### What changes were proposed in this pull request?

This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.

### Why are the changes needed?

We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests.

Closes #33912 from viirya/SPARK-36670.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 5a0ae69)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
@viirya viirya deleted the SPARK-36670 branch September 7, 2021 23:53
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.

We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.

No

Added tests.

Closes apache#33912 from viirya/SPARK-36670.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 5a0ae69)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants