-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17850][Core]Add a flag to ignore corrupt files #15422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc @marmbrus |
|
Test build #66691 has finished for PR 15422 at commit
|
|
retest this please |
|
Test build #66698 has finished for PR 15422 at commit
|
|
Why would corrupt record cause EOFException to be thrown ? |
|
If this happens it isn't clear that anything that was read is valid. It doesn't seem like something to ignore. Log, at least. I know people differ on this but I think continuing with partial and maybe corrupt read even with a warning seems more likely to cause tears. |
|
@srowen The tuples already returned would have been valid, it is the subsequent block decompression which has failed. For example, in a 1gb file, the last few bytes missing (or corrupt) will cause the last block to be decompressed incorrectly - but all previous tuples already returned would be fine and valid. The 'current' key/value resulted in exception. |
|
I agree that the data that was already read is probably good. I also think that this is a pretty big behavior change where there are legitimate cases (i.e. tons of data and it is fine to miss some) where you'd only want a warning. Can we add a flag for failing on unexpected EOF? (probably set to |
|
@mridulm This fix just makes HadoopRDD consistent with NewHadoopRDD and the current behavior of Spark SQL in 2.0. For 1.6, that's another story since Spark SQL uses HadoopRDD directly. However, note that this codes run in the executor side, which means logging a warning usually cannot be noticed by the user. |
|
@zsxwing You are right, NewHadoopRDD is not handling this case. Context is, for large jobs/data, it is not unexpected to see some data corruption at times. We dont want to throw out the entire job due to a few bad records. |
|
@marmbrus +1 on logging, that is definitely something which was probably missed here. |
|
@mridulm for the scenario you're imagining, maybe the data is OK, sure. That doesn't mean it's true in all cases. Yeah, this is really to work around bad input, which you can to some degree do at the user level. Other parts of Spark don't work this way. I'm neutral on whether this is a good idea at all, but would prefer consistency more than anything. |
I may be wrong. But in MR, I think bad records just means |
|
@zsxwing The map task is run by https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java , no ? You can take a look at skipping bad records here : https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Skipping+Bad+Records |
|
@srowen Since this is happening 'below' the user code (in the hadoop rdd), is there a way around how to handle this ? EOFException being thrown instead of IOException is an implementation detail of the codec if I am not wrong (I dont think the contract specifies this) - @zsxwing can confirm though, I am not very familiar with that. setting finished = true should probably be done not just for EOFException, but also for IOException. I am fine with that as well, though this is a behavior change. |
|
Test build #66776 has finished for PR 15422 at commit
|
| finished = true | ||
| } else { | ||
| throw e | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: case e: IOException if ignoreCorruptFiles =>
would have been more concise.
| "encountering corrupt files and contents that have been read will still be returned.") | ||
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why we are duplicating the parameter in sql namespace. Wont spark.files.ignoreCorruptFiles not do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sql conf can appear in the following command:
sql("set -v").filter('key contains "files").show(truncate = false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, thanks for clarifying !
| } else { | ||
| throw e | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changing this too !
| .doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " + | ||
| "encountering corrupt files and contents that have been read will still be returned.") | ||
| .booleanConf | ||
| .createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So either way we will have a behavioral change - if NewHadoopRDD vs HadoopRDD.
IMO that is fine, given that we are standardizing on the behavior and this is something which was a corner case anyway.
Setting default to false makes sense.
|
Merged - had issue with pip (new laptop, sigh), and so jira and pr did not get closed. |
|
I will work on a patch for 1.6. |
|
Test build #66839 has finished for PR 15422 at commit
|
|
PR for 1.6: #15454 |
|
@zsxwing shouldn't we at least log the exception? |
## What changes were proposed in this pull request? Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`. ## How was this patch tested? The added unit tests Author: Shixiong Zhu <[email protected]> Closes apache#15422 from zsxwing/SPARK-17850.
What changes were proposed in this pull request?
Add a flag to ignore corrupt files. For Spark core, the configuration is
spark.files.ignoreCorruptFiles. For Spark SQL, it'sspark.sql.files.ignoreCorruptFiles.How was this patch tested?
The added unit tests