-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24805][SQL] Do not ignore avro files without extensions by default #21769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gengliangwang @gatorsmile Please, have a look at the PR. |
|
Test build #93000 has finished for PR 21769 at commit
|
| // figure out the schema of the whole dataset. | ||
| val sampleFile = | ||
| if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { | ||
| if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running queries. The option avro.mapred.ignore.inputs.without.extension is not set in conf. This is a bug in spark-avro.
Please read the value from options. It would be good to have a new test case with avro.mapred.ignore.inputs.without.extension as true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The avro.mapred.ignore.inputs.without.extension is hadoop's parameter. This PR aims to change the default behavior only. I would prefer to do not convert the hadoop parameter to Avro datasource option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how people use the option so far: databricks/spark-avro#71 (comment) . Probably we should discuss seperatly from this PR how we could fix the "bug" and could not break backward compatibily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Hadoop config can be changed like:
spark
.sqlContext
.sparkContext
.hadoopConfiguration
.set("avro.mapred.ignore.inputs.without.extension", "true")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we submit a separate PR to add a new option for AVRO? We should not rely on hadoopConf to control the behaviors of AVRO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR: #21798 Please, have a look at it.
| } | ||
| } | ||
|
|
||
| test("SPARK-24805: reading files without .avro extension") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we create a temp path and copy the original episodes.avro to the path? So that we don't need to have two duplicated resource file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it just introduce unnesseccary dependency here and overcomplicate the test? I can create small (with just one row) avro file without .avro extension especially for the test if you don't mind.
| intercept[java.io.IOException] { | ||
| TestUtils.withTempDir { dir => | ||
| FileUtils.touch(new File(dir, "test")) | ||
| spark.read.avro(dir.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix the case as
spark.read.option("avro.mapred.ignore.inputs.without.extension", false).avro(dir.toString)
The behavior will be the same as before. And we don't need to modify the expected FileNotFoundException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually remove this piece of code from the test. It checked the default behavior but it is checked by special test now. Explicit settings for avro.mapred.ignore.inputs.without.extension should be checked in separate tests where the config is set explicitly.
|
Test build #93008 has finished for PR 21769 at commit
|
|
Test build #93009 has finished for PR 21769 at commit
|
|
Test build #93023 has finished for PR 21769 at commit
|
| .read | ||
| .option(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true") | ||
| .avro(tempSaveDir) | ||
| val count = try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: consider writing the try...finally like this:
val hadoopConf = spark.sqlContext.sparkContext.hadoopConfiguration
try {
hadoopConf.set(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true")
val count = spark.read.avro(tempSaveDir).count()
assert(count == 8)
} finally {
hadoopConf.unset(AvroFileFormat.IgnoreFilesWithoutExtensionProperty)
}
|
Test build #93034 has finished for PR 21769 at commit
|
|
Test build #93086 has finished for PR 21769 at commit
|
|
Test build #93102 has finished for PR 21769 at commit
|
|
Test build #93126 has finished for PR 21769 at commit
|
|
@gengliangwang @gatorsmile Please, have a look at the PR. |
|
LGTM since we should still keep the original behavior untouched. Thanks! Merged to master. BTW, can we submit a separate PR to add a new option for AVRO? We should not rely on hadoopConf to control the behaviors of AVRO in general. Let us support both. |
Sure, I will do. |
…ault In the PR, I propose to change default behaviour of AVRO datasource which currently ignores files without `.avro` extension in read by default. This PR sets the default value for `avro.mapred.ignore.inputs.without.extension` to `false` in the case if the parameter is not set by an user. Added a test file without extension in AVRO format, and new test for reading the file with and wihout specified schema. Author: Maxim Gekk <[email protected]> Author: Maxim Gekk <[email protected]> Closes apache#21769 from MaxGekk/avro-without-extension. (cherry picked from commit ba437fc)
What changes were proposed in this pull request?
In the PR, I propose to change default behaviour of AVRO datasource which currently ignores files without
.avroextension in read by default. This PR sets the default value foravro.mapred.ignore.inputs.without.extensiontofalsein the case if the parameter is not set by an user.How was this patch tested?
Added a test file without extension in AVRO format, and new test for reading the file with and wihout specified schema.