Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 14, 2018

What changes were proposed in this pull request?

In the PR, I propose to change default behaviour of AVRO datasource which currently ignores files without .avro extension in read by default. This PR sets the default value for avro.mapred.ignore.inputs.without.extension to false in the case if the parameter is not set by an user.

How was this patch tested?

Added a test file without extension in AVRO format, and new test for reading the file with and wihout specified schema.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 14, 2018

@gengliangwang @gatorsmile Please, have a look at the PR.

@SparkQA
Copy link

SparkQA commented Jul 14, 2018

Test build #93000 has finished for PR 21769 at commit 8562a8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [SPARK-24805][SQL] Do not ignore avro files without extensions [SPARK-24805][SQL] Do not ignore avro files without extensions by default Jul 14, 2018
// figure out the schema of the whole dataset.
val sampleFile =
if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running queries. The option avro.mapred.ignore.inputs.without.extension is not set in conf. This is a bug in spark-avro.
Please read the value from options. It would be good to have a new test case with avro.mapred.ignore.inputs.without.extension as true.

Copy link
Member Author

@MaxGekk MaxGekk Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The avro.mapred.ignore.inputs.without.extension is hadoop's parameter. This PR aims to change the default behavior only. I would prefer to do not convert the hadoop parameter to Avro datasource option here.

Copy link
Member Author

@MaxGekk MaxGekk Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how people use the option so far: databricks/spark-avro#71 (comment) . Probably we should discuss seperatly from this PR how we could fix the "bug" and could not break backward compatibily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Hadoop config can be changed like:

spark
  .sqlContext
  .sparkContext
  .hadoopConfiguration
  .set("avro.mapred.ignore.inputs.without.extension", "true")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we submit a separate PR to add a new option for AVRO? We should not rely on hadoopConf to control the behaviors of AVRO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR: #21798 Please, have a look at it.

}
}

test("SPARK-24805: reading files without .avro extension") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we create a temp path and copy the original episodes.avro to the path? So that we don't need to have two duplicated resource file.

Copy link
Member Author

@MaxGekk MaxGekk Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it just introduce unnesseccary dependency here and overcomplicate the test? I can create small (with just one row) avro file without .avro extension especially for the test if you don't mind.

intercept[java.io.IOException] {
TestUtils.withTempDir { dir =>
FileUtils.touch(new File(dir, "test"))
spark.read.avro(dir.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix the case as

spark.read.option("avro.mapred.ignore.inputs.without.extension", false).avro(dir.toString)

The behavior will be the same as before. And we don't need to modify the expected FileNotFoundException

Copy link
Member Author

@MaxGekk MaxGekk Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually remove this piece of code from the test. It checked the default behavior but it is checked by special test now. Explicit settings for avro.mapred.ignore.inputs.without.extension should be checked in separate tests where the config is set explicitly.

@SparkQA
Copy link

SparkQA commented Jul 14, 2018

Test build #93008 has finished for PR 21769 at commit a7d078e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2018

Test build #93009 has finished for PR 21769 at commit 3b75c27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2018

Test build #93023 has finished for PR 21769 at commit bb1098f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.read
.option(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true")
.avro(tempSaveDir)
val count = try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consider writing the try...finally like this:

      val hadoopConf = spark.sqlContext.sparkContext.hadoopConfiguration
      try {
        hadoopConf.set(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true")
        val count = spark.read.avro(tempSaveDir).count()
        assert(count == 8)
      } finally {
        hadoopConf.unset(AvroFileFormat.IgnoreFilesWithoutExtensionProperty)
      }

@SparkQA
Copy link

SparkQA commented Jul 15, 2018

Test build #93034 has finished for PR 21769 at commit 91e40e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93086 has finished for PR 21769 at commit a7f3835.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType)
  • sealed trait CatalystDataUpdater
  • final class RowUpdater(row: InternalRow) extends CatalystDataUpdater
  • final class ArrayDataUpdater(array: ArrayData) extends CatalystDataUpdater
  • class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable: Boolean)
  • class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)
  • class SerializableSchema(@transient var value: Schema)

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93102 has finished for PR 21769 at commit 134c724.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93126 has finished for PR 21769 at commit 85cdf87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 16, 2018

@gengliangwang @gatorsmile Please, have a look at the PR.

@gatorsmile
Copy link
Member

LGTM since we should still keep the original behavior untouched.

Thanks! Merged to master.

BTW, can we submit a separate PR to add a new option for AVRO? We should not rely on hadoopConf to control the behaviors of AVRO in general. Let us support both.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 16, 2018

can we submit a separate PR to add a new option for AVRO?

Sure, I will do.

@asfgit asfgit closed this in ba437fc Jul 16, 2018
@MaxGekk MaxGekk deleted the avro-without-extension branch August 17, 2019 13:35
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…ault

In the PR, I propose to change default behaviour of AVRO datasource which currently ignores files without `.avro` extension in read by default. This PR sets the default value for `avro.mapred.ignore.inputs.without.extension` to `false` in the case if the parameter is not set by an user.

Added a test file without extension in AVRO format, and new test for reading the file with and wihout specified schema.

Author: Maxim Gekk <[email protected]>
Author: Maxim Gekk <[email protected]>

Closes apache#21769 from MaxGekk/avro-without-extension.

(cherry picked from commit ba437fc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants