[SPARK-17850][Core]Add a flag to ignore corrupt files #15422

zsxwing · 2016-10-10T22:41:50Z

What changes were proposed in this pull request?

Add a flag to ignore corrupt files. For Spark core, the configuration is spark.files.ignoreCorruptFiles. For Spark SQL, it's spark.sql.files.ignoreCorruptFiles.

How was this patch tested?

The added unit tests

zsxwing · 2016-10-10T22:42:52Z

/cc @marmbrus

SparkQA · 2016-10-11T00:30:18Z

Test build #66691 has finished for PR 15422 at commit f810937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-11T00:40:45Z

retest this please

SparkQA · 2016-10-11T02:32:00Z

Test build #66698 has finished for PR 15422 at commit f810937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2016-10-11T16:30:19Z

Why would corrupt record cause EOFException to be thrown ?
EDIT: In the specific test, the entire file is corrupt. In general for gzip files, we can actually process partial file with suffix being unrecoverable.
In that case, it is not possible to proceed further anyway - so why is handling EOF here incorrect ?

srowen · 2016-10-11T16:55:50Z

If this happens it isn't clear that anything that was read is valid. It doesn't seem like something to ignore. Log, at least. I know people differ on this but I think continuing with partial and maybe corrupt read even with a warning seems more likely to cause tears.

mridulm · 2016-10-11T17:24:10Z

@srowen The tuples already returned would have been valid, it is the subsequent block decompression which has failed. For example, in a 1gb file, the last few bytes missing (or corrupt) will cause the last block to be decompressed incorrectly - but all previous tuples already returned would be fine and valid.

The 'current' key/value resulted in exception.
Note that the returned value from the method is not used when finished = true (what is returned is ignored).

marmbrus · 2016-10-11T17:29:03Z

I agree that the data that was already read is probably good. I also think that this is a pretty big behavior change where there are legitimate cases (i.e. tons of data and it is fine to miss some) where you'd only want a warning. Can we add a flag for failing on unexpected EOF? (probably set to true in master and false in branch-1.6).

zsxwing · 2016-10-11T17:35:44Z

@mridulm This fix just makes HadoopRDD consistent with NewHadoopRDD and the current behavior of Spark SQL in 2.0.

For 1.6, that's another story since Spark SQL uses HadoopRDD directly. However, note that this codes run in the executor side, which means logging a warning usually cannot be noticed by the user.

mridulm · 2016-10-11T17:42:39Z

@zsxwing You are right, NewHadoopRDD is not handling this case.
Probably would be good to add exception handling there when nextKeyValue throws exception ?

Context is, for large jobs/data, it is not unexpected to see some data corruption at times. We dont want to throw out the entire job due to a few bad records.
For example, in MR you have the ability to even set the percentage of bad records you want to tolerate (we dont have that in spark).

mridulm · 2016-10-11T17:43:09Z

@marmbrus +1 on logging, that is definitely something which was probably missed here.

srowen · 2016-10-11T17:50:14Z

@mridulm for the scenario you're imagining, maybe the data is OK, sure. That doesn't mean it's true in all cases. Yeah, this is really to work around bad input, which you can to some degree do at the user level. Other parts of Spark don't work this way. I'm neutral on whether this is a good idea at all, but would prefer consistency more than anything.

zsxwing · 2016-10-11T18:06:20Z

For example, in MR you have the ability to even set the percentage of bad records you want to tolerate (we dont have that in spark).

I may be wrong. But in MR, I think bad records just means map or reduce throws an exception. It's not related to any IOException (including EOFExcpetion). (https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1490)

mridulm · 2016-10-11T21:27:33Z

@zsxwing The map task is run by https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java , no ?

You can take a look at skipping bad records here : https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Skipping+Bad+Records

mridulm · 2016-10-11T21:35:54Z

@srowen Since this is happening 'below' the user code (in the hadoop rdd), is there a way around how to handle this ?
I agree that for a lot of usecases where it is critical to work off the entire data, we should abort rather than process incomplete (and corrupt) data; not sure if we want to customize ability to do this though.

EOFException being thrown instead of IOException is an implementation detail of the codec if I am not wrong (I dont think the contract specifies this) - @zsxwing can confirm though, I am not very familiar with that.

setting finished = true should probably be done not just for EOFException, but also for IOException.

I am fine with that as well, though this is a behavior change.

SparkQA · 2016-10-12T02:26:43Z

Test build #66776 has finished for PR 15422 at commit ef88a64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2016-10-12T03:57:13Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

+              finished = true
+            } else {
+              throw e
+            }


nit: case e: IOException if ignoreCorruptFiles =>
would have been more concise.

mridulm · 2016-10-12T04:03:02Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "encountering corrupt files and contents that have been read will still be returned.")
+    .booleanConf
+    .createWithDefault(false)
+


Curious why we are duplicating the parameter in sql namespace. Wont spark.files.ignoreCorruptFiles not do ?

A sql conf can appear in the following command:

sql("set -v").filter('key contains "files").show(truncate = false)

interesting, thanks for clarifying !

mridulm · 2016-10-12T04:03:38Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

+              } else {
+                throw e
+              }
+          }


Thanks for changing this too !

mridulm · 2016-10-12T04:04:50Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+    .doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " +
+      "encountering corrupt files and contents that have been read will still be returned.")
+    .booleanConf
+    .createWithDefault(false)


So either way we will have a behavioral change - if NewHadoopRDD vs HadoopRDD.
IMO that is fine, given that we are standardizing on the behavior and this is something which was a corner case anyway.

Setting default to false makes sense.

mridulm · 2016-10-12T20:54:40Z

Merged - had issue with pip (new laptop, sigh), and so jira and pr did not get closed.

zsxwing · 2016-10-12T21:10:39Z

I will work on a patch for 1.6.

SparkQA · 2016-10-12T22:21:37Z

Test build #66839 has finished for PR 15422 at commit ceecc19.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-12T23:07:56Z

PR for 1.6: #15454

rxin · 2016-12-07T07:12:23Z

@zsxwing shouldn't we at least log the exception?

## What changes were proposed in this pull request? Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`. ## How was this patch tested? The added unit tests Author: Shixiong Zhu <[email protected]> Closes apache#15422 from zsxwing/SPARK-17850.

HadoopRDD should not catch EOFException

f810937

Add a flag to ignore corrupt files

ef88a64

zsxwing changed the title ~~[SPARK-17850][Core]HadoopRDD should not catch EOFException~~ [SPARK-17850][Core]Add a flag to ignore corrupt files Oct 12, 2016

mridulm approved these changes Oct 12, 2016

View reviewed changes

Address nits

ceecc19

asfgit closed this in 47776e7 Oct 12, 2016

zsxwing deleted the SPARK-17850 branch October 12, 2016 22:26

[SPARK-17850][Core]Add a flag to ignore corrupt files #15422

[SPARK-17850][Core]Add a flag to ignore corrupt files #15422

Uh oh!

Conversation

zsxwing commented Oct 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

zsxwing commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

mridulm commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Oct 11, 2016

Uh oh!

mridulm commented Oct 11, 2016

Uh oh!

marmbrus commented Oct 11, 2016

Uh oh!

zsxwing commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Oct 11, 2016

Uh oh!

mridulm commented Oct 11, 2016

Uh oh!

srowen commented Oct 11, 2016

Uh oh!

zsxwing commented Oct 11, 2016

Uh oh!

mridulm commented Oct 11, 2016

Uh oh!

mridulm commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

mridulm Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm commented Oct 12, 2016

Uh oh!

zsxwing commented Oct 12, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

zsxwing commented Oct 12, 2016

Uh oh!

rxin commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zsxwing commented Oct 10, 2016 •

edited

Loading

mridulm commented Oct 11, 2016 •

edited

Loading

zsxwing commented Oct 11, 2016 •

edited

Loading