[SPARK-24881][SQL] New Avro option - compression #21837

MaxGekk · 2018-07-22T09:01:50Z

What changes were proposed in this pull request?

In the PR, I added new option for Avro datasource - compression. The option allows to specify compression codec for saved Avro files. This option is similar to compression option in another datasources like JSON and CSV.

Also I added the SQL configs spark.sql.avro.compression.codec and spark.sql.avro.deflate.level. I put the configs into SQLConf. If the compression option is not specified by an user, the first SQL config is taken into account.

How was this patch tested?

I added new test which read meta info from written avro files and checks avro.codec property.

SparkQA · 2018-07-22T10:49:08Z

Test build #93404 has finished for PR 21837 at commit c5802df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-22T11:49:30Z

jenkins, retest this, please

SparkQA · 2018-07-22T13:20:59Z

Test build #93405 has finished for PR 21837 at commit f8b580b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-22T15:06:04Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+        .avro(snappyDir)
+
+      val uncompressSize = FileUtils.sizeOfDirectory(new File(uncompressDir))
+      val deflateSize = FileUtils.sizeOfDirectory(new File(deflateDir))


is there any or easy way to check the metadata for compression level?

SparkQA · 2018-07-22T15:56:01Z

Test build #93408 has finished for PR 21837 at commit f8b580b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-22T19:21:21Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala


      case "deflate" =>
-        val deflateLevel = spark.conf.get(
-          AVRO_DEFLATE_LEVEL, Deflater.DEFAULT_COMPRESSION.toString).toInt


Deflater.DEFAULT_COMPRESSION is -1 here. Why change the default value to -6 in SQLConf?

I changed it because I didn't have any ideas what -1 means. Is it closer to best compression 9 or fast compression 1? Probably the level is eventually set in zlib in which -1 means 6. @gengliangwang From your point of view, -1 means better or faster compression?

If the compression algorithm changes in the feature, the default value may not be 6. But if we use DEFAULT_COMPRESSION, it will still be equivalent to the new default value. So I suggest we still use Deflater.DEFAULT_COMPRESSION instead of any specific number.

gengliangwang · 2018-07-22T19:22:44Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      val deflateDir = s"$dir/deflate"
+      val snappyDir = s"$dir/snappy"
+
+      val df = spark.read.avro(testAvro)


I am removing all the .avro method: #21841
Could you change it to ".format("avro").load" or ".format("avro").save"? Thanks!

# Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

MaxGekk · 2018-07-23T20:51:39Z

is there any or easy way to check the metadata for compression level?

@HyukjinKwon I am not sure the level exists in the metadata. At least avro-tools doesn't show it.

java -jar ~/avro-tools-1.7.7.jar getmeta ./part-00000-780d0d02-ec8c-407e-8427-1790b81a3726-c000.avro
avro.schema	{"type":"record","name":"topLevelRecord","namespace":".topLevelRecord","fields": ...}
avro.codec	deflate

SparkQA · 2018-07-24T00:32:38Z

Test build #93461 has finished for PR 21837 at commit d21d3e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-24T01:37:19Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+        .save(snappyDir)
+
+      val uncompressSize = FileUtils.sizeOfDirectory(new File(uncompressDir))
+      val deflateSize = FileUtils.sizeOfDirectory(new File(deflateDir))


Thank you, @MaxGekk. Can we then check the type of compression at least avro.codec deflate?

HyukjinKwon · 2018-07-24T01:39:58Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

+   * config `spark.sql.avro.deflate.level` is used by default. For other compressions, the default
+   * value is `6`.
+   */
+  val compressionLevel: Int = {


Can we don't expose this as an option for now? IIUC, this compression level only applies to deflate, right? Also, this option looks not for keeping the same options from the thrid party as well.

I added the option keeping in mind other compression codecs can be added in the future, for example zstandard. For those codecs, the level could be useful too. Another point is specifying compression level together with compression codec in Avro options looks more natural comparing to SQL global settings:

df.write .options(Map("compression" -> "deflate", "compressionLevel" -> "9")) .format("avro") .save(deflateDir)

vs

spark.conf.set("spark.sql.avro.deflate.level", "9") df.write .option("compression", "deflate")) .format("avro") .save(deflateDir)

Yea, I know that could be useful in some ways but I was thinking we should better not add this just for now. Thing is, it sounds currently too specific to one compression option in Avro for now .. There are many options to expose in, for example in CSV datasource too in this way ..

Also, to be honest, I wonder users would want to change compression level often ..

HyukjinKwon · 2018-07-24T02:38:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val AVRO_COMPRESSION_CODEC = buildConf("spark.sql.avro.compression.codec")
+    .doc("Compression codec used in writing of AVRO files.")
+    .stringConf
+    .createWithDefault("snappy")


can we .checkValues(Set( too?

HyukjinKwon · 2018-07-24T02:46:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val AVRO_DEFLATE_LEVEL = buildConf("spark.sql.avro.deflate.level")
+    .doc("Compression level for the deflate codec used in writing of AVRO files. " +
+      "Valid value must be in the range of from 1 to 9 inclusive. " +
+      "The default value is -1 which corresponds to 6 level in the current implementation.")


can we do the check like checkValue(_ => -1, ...) here

HyukjinKwon · 2018-07-24T02:46:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  val AVRO_DEFLATE_LEVEL = buildConf("spark.sql.avro.deflate.level")
+    .doc("Compression level for the deflate codec used in writing of AVRO files. " +
+      "Valid value must be in the range of from 1 to 9 inclusive. " +


This can be -1 right (https://www.zlib.net/manual.html)?

# Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

viirya · 2018-07-26T22:21:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .createWithDefault(20)
+
+  val AVRO_COMPRESSION_CODEC = buildConf("spark.sql.avro.compression.codec")
+    .doc("Compression codec used in writing of AVRO files.")


Document the default value?

SparkQA · 2018-07-27T00:27:07Z

Test build #93636 has finished for PR 21837 at commit 5561582.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

SparkQA · 2018-07-27T01:36:36Z

Test build #93632 has finished for PR 21837 at commit b315f37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-27T01:41:18Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

    @transient val parameters: CaseInsensitiveMap[String],
-    @transient val conf: Configuration) extends Logging with Serializable {
+    @transient val conf: Configuration,
+    @transient val sqlConf: SQLConf) extends Logging with Serializable {


We may just get SQLConf by calling SQLConf.get without passing it in.

viirya · 2018-07-27T01:41:49Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

-  test("write with compression") {
+  test("write with compression - sql configs") {
    withTempPath { dir =>
      val AVRO_COMPRESSION_CODEC = "spark.sql.avro.compression.codec"


This can use SQLConf.AVRO_COMPRESSION_CODEC.key now.

viirya · 2018-07-27T01:41:57Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+  test("write with compression - sql configs") {
    withTempPath { dir =>
      val AVRO_COMPRESSION_CODEC = "spark.sql.avro.compression.codec"
      val AVRO_DEFLATE_LEVEL = "spark.sql.avro.deflate.level"


SparkQA · 2018-07-27T01:46:46Z

Test build #93634 has finished for PR 21837 at commit 6915f34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T02:44:14Z

Test build #93637 has finished for PR 21837 at commit ebaf327.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T07:05:01Z

Test build #93658 has finished for PR 21837 at commit 5f83902.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-27T07:12:32Z

jenkins, retest this, please

SparkQA · 2018-07-27T11:21:15Z

Test build #93659 has finished for PR 21837 at commit 5f83902.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-27T16:11:09Z

Merged to master.

In the PR, I added new option for Avro datasource - `compression`. The option allows to specify compression codec for saved Avro files. This option is similar to `compression` option in another datasources like `JSON` and `CSV`. Also I added the SQL configs `spark.sql.avro.compression.codec` and `spark.sql.avro.deflate.level`. I put the configs into `SQLConf`. If the `compression` option is not specified by an user, the first SQL config is taken into account. I added new test which read meta info from written avro files and checks `avro.codec` property. Author: Maxim Gekk <[email protected]> Closes apache#21837 from MaxGekk/avro-compression. (cherry picked from commit 0a0f68b)

MaxGekk added 4 commits July 21, 2018 23:51

Adding compression and compression level options

311482a

Adding a test for new options

0caa0f0

Ticket number is added to test title.

c5802df

Removing unneeded toLowerCase

f8b580b

HyukjinKwon reviewed Jul 22, 2018

View reviewed changes

gengliangwang reviewed Jul 22, 2018

View reviewed changes

MaxGekk added 3 commits July 23, 2018 22:25

Merge remote-tracking branch 'origin/master' into avro-compression

c2373f2

# Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

Replace avro by format("avro").save/load

8c1746c

Set default value to -1 for the deflate codec.

d21d3e9

HyukjinKwon reviewed Jul 24, 2018

View reviewed changes

MaxGekk added 4 commits July 26, 2018 11:26

Merge remote-tracking branch 'origin/master' into avro-compression

ac117a7

# Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

Reading codec name from meta info in the test

41f1936

Removing the compressionLevel option

952bdb0

Check config values

b315f37

MaxGekk changed the title ~~[SPARK-24881][SQL] New Avro options - compression and compressionLevel~~ [SPARK-24881][SQL] New Avro option - compression Jul 26, 2018

Fix val name

6915f34

viirya reviewed Jul 26, 2018

View reviewed changes

MaxGekk added 2 commits July 27, 2018 00:38

Document default value for codec

5561582

Updating comment for Avro option

ebaf327

HyukjinKwon approved these changes Jul 27, 2018

View reviewed changes

viirya reviewed Jul 27, 2018

View reviewed changes

Addressing Liang-Chi Hsieh's review comments

5f83902

asfgit closed this in 0a0f68b Jul 27, 2018

MaxGekk deleted the avro-compression branch August 17, 2019 13:35

[SPARK-24881][SQL] New Avro option - compression #21837

[SPARK-24881][SQL] New Avro option - compression #21837

Uh oh!

Conversation

MaxGekk commented Jul 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 22, 2018

Uh oh!

MaxGekk commented Jul 22, 2018

Uh oh!

SparkQA commented Jul 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

MaxGekk commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

HyukjinKwon commented Jul 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

MaxGekk commented Jul 22, 2018 •

edited

Loading

HyukjinKwon Jul 24, 2018 •

edited

Loading