-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24836][SQL] New option for Avro datasource - ignoreExtension #21798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #93196 has finished for PR 21798 at commit
|
| def ignoreFilesWithoutExtensions(conf: Configuration): Boolean = { | ||
| // Files without .avro extensions are not ignored by default | ||
| val defaultValue = false | ||
| def ignoreExtension(conf: Configuration, options: Map[String, String]): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a class AvroOptions like what we are doing for the other built-in data sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to see it as a part of this PR or a separate one? I would extract some common code like getBool() from CSVOptions to a separate trait and extend AvroOptions by it.
|
|
||
| conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, defaultValue) | ||
| options | ||
| .get("ignoreExtension") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we document this option somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I am going to add the AvroOptions class and document all Avro options there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure we describe that in a public API later.
|
Please, look at this PR: #21810 . It introduces |
|
Test build #93283 has finished for PR 21798 at commit
|
|
I added new option to |
| def ignoreFilesWithoutExtensions(conf: Configuration): Boolean = { | ||
| // Files without .avro extensions are not ignored by default | ||
| val defaultValue = false | ||
| def ignoreExtension(conf: Configuration, options: AvroOptions): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to object AvroOptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a comment above this function to describe how we determine it?
|
LGTM except a comment |
|
Test build #93363 has finished for PR 21798 at commit
|
|
Test build #93364 has finished for PR 21798 at commit
|
|
LGTM Thanks! Merged to master. |
I propose to add new option for AVRO datasource which should control ignoring of files without `.avro` extension in read. The option has name `ignoreExtension` with default value `true`. If both options `ignoreExtension` and `avro.mapred.ignore.inputs.without.extension` are set, `ignoreExtension` overrides the former one. Here is an example of usage:
```
spark
.read
.option("ignoreExtension", false)
.avro("path to avro files")
```
I added a test which checks the option directly and a test for checking that new option overrides hadoop's config.
Author: Maxim Gekk <[email protected]>
Closes apache#21798 from MaxGekk/avro-ignore-extension.
(cherry picked from commit 106880e)
What changes were proposed in this pull request?
I propose to add new option for AVRO datasource which should control ignoring of files without
.avroextension in read. The option has nameignoreExtensionwith default valuetrue. If both optionsignoreExtensionandavro.mapred.ignore.inputs.without.extensionare set,ignoreExtensionoverrides the former one. Here is an example of usage:How was this patch tested?
I added a test which checks the option directly and a test for checking that new option overrides hadoop's config.