Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 17, 2018

What changes were proposed in this pull request?

I propose to add new option for AVRO datasource which should control ignoring of files without .avro extension in read. The option has name ignoreExtension with default value true. If both options ignoreExtension and avro.mapred.ignore.inputs.without.extension are set, ignoreExtension overrides the former one. Here is an example of usage:

spark
  .read
  .option("ignoreExtension", false)
  .avro("path to avro files")

How was this patch tested?

I added a test which checks the option directly and a test for checking that new option overrides hadoop's config.

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93196 has finished for PR 21798 at commit 565e599.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def ignoreFilesWithoutExtensions(conf: Configuration): Boolean = {
// Files without .avro extensions are not ignored by default
val defaultValue = false
def ignoreExtension(conf: Configuration, options: Map[String, String]): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a class AvroOptions like what we are doing for the other built-in data sources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to see it as a part of this PR or a separate one? I would extract some common code like getBool() from CSVOptions to a separate trait and extend AvroOptions by it.


conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, defaultValue)
options
.get("ignoreExtension")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we document this option somewhere?

Copy link
Member Author

@MaxGekk MaxGekk Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I am going to add the AvroOptions class and document all Avro options there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we describe that in a public API later.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 18, 2018

Please, look at this PR: #21810 . It introduces AvroOptions.

@SparkQA
Copy link

SparkQA commented Jul 19, 2018

Test build #93283 has finished for PR 21798 at commit 3bd3475.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 19, 2018

I added new option to AvroOptions and documented it. Please, look at the PR one more time.

def ignoreFilesWithoutExtensions(conf: Configuration): Boolean = {
// Files without .avro extensions are not ignored by default
val defaultValue = false
def ignoreExtension(conf: Configuration, options: AvroOptions): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to object AvroOptions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a comment above this function to describe how we determine it?

@gatorsmile
Copy link
Member

LGTM except a comment

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93363 has finished for PR 21798 at commit 0657508.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AvroOptions(

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93364 has finished for PR 21798 at commit 3206a20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 106880e Jul 21, 2018
@MaxGekk MaxGekk deleted the avro-ignore-extension branch August 17, 2019 13:35
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
I propose to add new option for AVRO datasource which should control ignoring of files without `.avro` extension in read. The option has name `ignoreExtension` with default value `true`. If both options `ignoreExtension` and `avro.mapred.ignore.inputs.without.extension` are set, `ignoreExtension` overrides the former one. Here is an example of usage:

```
spark
  .read
  .option("ignoreExtension", false)
  .avro("path to avro files")
```

I added a test which checks the option directly and a test for checking that new option overrides hadoop's config.

Author: Maxim Gekk <[email protected]>

Closes apache#21798 from MaxGekk/avro-ignore-extension.

(cherry picked from commit 106880e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants