-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles #24668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105639 has finished for PR 24668 at commit
|
|
That most recent CI run has some legitimate looking test failures: The failures appear to be related to Hive CTAS and/or partitioned tables. |
|
It looks like this change is breaking the ability to drop a catalog table whose underlying files don't exist / have been deleted. In catalog.refreshTable(tableName)
catalog.dropTable(tableName, ifExists, purge)Here, To avoid this, I think I can more narrowly-scope this patch's changes to only propagate |
|
Test build #105646 has finished for PR 24668 at commit
|
|
Yea, I was thinking about respecting this option but just decided to follow the existent way to push the fix quick. I agree with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen, since it's a behaviour change, can we update SQL migration guide as well (https://github.com/apache/spark/tree/master/docs)? At least I know one customer who will faces this issue right away :D.
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
Outdated
Show resolved
Hide resolved
|
@HyukjinKwon, thanks for the pointer to the migration guide (I couldn't remember if this lived in Do you think this is safe to target for a 2.4.x backport release? Or should we make it 3.x only? I'll tentatively write the migration guide as though it's only targeting 3.x and will update it if we think a backport is okay. I don't feel too strongly about the 2.4.x backport (since I can always internally cherry-pick it for an internal build (where I can be less worried about breaking changes)). |
| } | ||
| } | ||
|
|
||
| test("SPARK-27676: InMemoryFileIndex respects ignoreMissingFiles config for non-root paths") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one config parallelPartitionDiscoveryThreshold can control code path of partition discovery. With the default value, this only tests serial listing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In the case of a parallel listing, this would cause the listing Spark job to fail with a FileNotFoundException (after maxTaskRetries attempts to list the missing file).
In the interests of complete test coverage, I'll update the test case to exercise the parallel listing path, too.
(Combinatorial test coverage is hard!)
|
Oh, I was thinking we should do it 3.x only .. i thought it might be quite breaking change although this PR adds a configuration to keep previous behaviour. I prefer not to backport but I am fine if you or somebody strongly feels. |
|
I took a stab at writing a migration guide entry, but the resulting entry is a bit subtle (like this bug):
Feedback is very welcome here, especially if you can think of a clearer way to describe this change. |
|
Test build #105655 has finished for PR 24668 at commit
|
|
Test build #105648 has finished for PR 24668 at commit
|
|
Test build #105659 has finished for PR 24668 at commit
|
|
Test build #105657 has finished for PR 24668 at commit
|
…inst case discussed during apache#24672 review)
|
Test build #105667 has finished for PR 24668 at commit
|
|
jenkins retest this please |
|
Test build #105670 has finished for PR 24668 at commit
|
|
jenkins retest this please |
|
retest this please |
|
FYI, there's an interesting discussion over at #24672 (comment) which illustrates some of costs of supporting feature-flagged backwards-compatibility w.r.t. the old behavior: if we switch to using more optimized This is tricky because we're trying to define behavior for race conditions where the end user has expressed that they don't care about missing files ( |
|
Test build #105696 has finished for PR 24668 at commit
|
|
On further reflection, it's not necessarily safe to ignore deletions at the root level because that still leaves us vulnerable to certain races (e.g. if we globStatus on the driver to list the first level, then pass those paths to However, if you actually do delete underlying data on purpose then an explicit Given these existing behaviors, it's somewhat tricky to fix the "root path throws FileNotFoundException" case without breaking existing behaviors. However, consider a workload which never does I'm going to give that a try now. |
|
I can't see a clean way to handle the root path case without substantial risk of breaking existing code, so I've amended the comment to explain this. I think this PR probably represents the best current trade-off between preserving existing behavior and detecting a certain subset of race conditions with high precision (at the cost of poorer recall). |
|
Test build #105744 has finished for PR 24668 at commit
|
|
Okie. @JoshRosen Can we resolve conflicts? I Will take a look and get this in |
|
Test build #106818 has finished for PR 24668 at commit
|
|
retest this please |
|
Test build #106825 has finished for PR 24668 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
Spark's
InMemoryFileIndexcontains two places whereFileNotFoundexceptions are caught and logged as warnings (during directory listing and block location lookup). This logic was added in #15153 and #21408.I think that this is a dangerous default behavior because it can mask bugs caused by race conditions (e.g. overwriting a table while it's being read) or S3 consistency issues (there's more discussion on this in the JIRA ticket). Failing fast when we detect missing files is not sufficient to make concurrent table reads/writes or S3 listing safe (there are other classes of eventual consistency issues to worry about), but I think it's still beneficial to throw exceptions and fail-fast on the subset of inconsistencies / races that we can detect because that increases the likelihood that an end user will notice the problem and investigate further.
There may be some cases where users do want to ignore missing files, but I think that should be an opt-in behavior via the existing
spark.sql.files.ignoreMissingFilesflag (the current behavior is itself race-prone because a file might be be deleted between catalog listing and query execution time, triggering FileNotFoundExceptions on executors (which are handled in a way that does respectignoreMissingFIles)).This PR updates
InMemoryFileIndexto guard the log-and-ignore-FileNotFoundException behind the existingspark.sql.files.ignoreMissingFilesflag.Note: this is a change of default behavior, so I think it needs to be mentioned in release notes.
How was this patch tested?
New unit tests to simulate file-deletion race conditions, tested with both values of the
ignoreMissingFIlesflag.