-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite #33912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this issue during adding these tests. Created SPARK-36669 for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #33913 too.
This comment has been minimized.
This comment has been minimized.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
This comment has been minimized.
This comment has been minimized.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #142997 has finished for PR 33912 at commit
|
|
Thank you, @viirya . |
| override def dataSourceName: String = "orc" | ||
| override val codecConfigName = SQLConf.ORC_COMPRESSION.key | ||
| override protected def availableCodecs = Seq("none", "uncompressed", "snappy", | ||
| "zlib", "zstd", "lz4", "lzo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reviewers: As you see here, Apache ORC has no issue because it uses AircompressorCodec LZ4.
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.SharedSparkSession | ||
|
|
||
| class OrcCodecTestSuite extends DataSourceCodecTest with SharedSparkSession{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, OrcCodecTestSuite -> OrcCodecSuite?
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.SharedSparkSession | ||
|
|
||
| class ParquetCodecTestSuite extends DataSourceCodecTest with SharedSparkSession { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, ParquetCodecTestSuite -> ParquetCodecSuite
| } | ||
| } | ||
|
|
||
| testWithAllCodecs("write and read - single partition") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case seems to be included in write and read. Do we need this test case separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the partition number is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, it looks like that. In that case, there is no difference in terms of the test coverage.
| import org.apache.spark.sql.QueryTest | ||
| import org.apache.spark.sql.test.SQLTestUtils | ||
|
|
||
| abstract class DataSourceCodecTest extends QueryTest with SQLTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to test only file-based data source, we can make a single simple suite like FileBasedDataSourceSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, FileBasedDataSourceCodecSuite?
private val allFileBasedDataSources = Seq("orc", "parquet", ...)
allFileBasedDataSources.foreach { format =>
test(s"... - $format") {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good. let me refactor it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this code style is a bit hard to extend (e.g. how to test avro?) and I was planning to refactor the existing test suites as well.
I think a better solution is
trait FileSourceCodecSuite ... {
def format: String
...
}
class ParquetCodecSuite extends FileSourceCodecSuite
class OrcCodecSuite ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan's idea is more close to what I have in mind at the beginning. @dongjoon-hyun is it good for you too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think these test suites can be in one file if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @viirya and @cloud-fan , +1 for the proposed structure in a single file.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| /** | ||
| * A temporary workaround for SPARK-36669. We should remove this after Hadoop 3.3.2 release | ||
| * which fixes the LZ4 relocation in shaded Hadoop client libraries. This does not need | ||
| * implement all net.jpountz.lz4.LZ4Compressor API, just the ones used by Hadoop Lz4Compressor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is not a test-only PR. Can we update the PR title and description accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, let me update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Kubernetes integration test status failure |
|
Test build #143033 has finished for PR 33912 at commit
|
|
Could we add a test for hadoop seq files using We might want to exclude the relocation of snappy in Hadoop as well. The relocation only relocates the java classes, and the native jni interfaces will not be relocated. So let's say we include two different version of I remember @dongjoon-hyun saw this issue when he worked on the |
|
Test build #143036 has finished for PR 33912 at commit
|
This comment has been minimized.
This comment has been minimized.
Let me add test for hadoop seq files in different PR. For snappy-java, that's good point. Looks like we also need to exclude it from relocation in Hadoop. |
Not only for compatibility issue. Actually native library shouldn't be relocated due to JNI method resolution. The relocated snappy-java cannot resolve native methods in SnappyCodec. Created another blocker SPARK-36681 for the issue. We definitely should exclude snappy-java from relocation at Hadoop. I'm excluding it together with lz4-java in the same Hadoop PR. Different to lz4-java issue that we can add wrappers as workaround, for snappy-java as it is both relocated and included in the client libraries, seems to me this kind of workaround cannot work. |
|
As the workaround cannot work on snappy-java, it makes less sense to add the wrapper classes here. I will remove them and keep only the codec working for now. We will deal the lz4-java and snappy-java issues in Hadoop and in separate JIRAs. |
|
Thank you for the update, @viirya . |
|
LGTM. @viirya do we have a separate PR for compressed hadoop seq files? Thanks. |
pom.xml
Outdated
| <enabled>false</enabled> | ||
| </snapshots> | ||
| </repository> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, remove empty line?
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils} | ||
|
|
||
| trait FileSourceCodecSuite extends QueryTest with SQLTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's make it
trait FileSourceCodecSuite extends QueryTest with SharedSparkSession
here so that ParquetCodecSuite doesn't need to extend SharedSparkSession
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
This comment has been minimized.
This comment has been minimized.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143059 has finished for PR 33912 at commit
|
|
Thanks for reviewing! Merging to master/3.2. |
### What changes were proposed in this pull request? This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. ### Why are the changes needed? We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 5a0ae69) Signed-off-by: Liang-Chi Hsieh <[email protected]>
This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. No Added tests. Closes apache#33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 5a0ae69) Signed-off-by: Liang-Chi Hsieh <[email protected]>
What changes were proposed in this pull request?
This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.
Why are the changes needed?
We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added tests.