[SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles #24668

JoshRosen · 2019-05-21T20:40:44Z

What changes were proposed in this pull request?

Spark's InMemoryFileIndex contains two places where FileNotFound exceptions are caught and logged as warnings (during directory listing and block location lookup). This logic was added in #15153 and #21408.

I think that this is a dangerous default behavior because it can mask bugs caused by race conditions (e.g. overwriting a table while it's being read) or S3 consistency issues (there's more discussion on this in the JIRA ticket). Failing fast when we detect missing files is not sufficient to make concurrent table reads/writes or S3 listing safe (there are other classes of eventual consistency issues to worry about), but I think it's still beneficial to throw exceptions and fail-fast on the subset of inconsistencies / races that we can detect because that increases the likelihood that an end user will notice the problem and investigate further.

There may be some cases where users do want to ignore missing files, but I think that should be an opt-in behavior via the existing spark.sql.files.ignoreMissingFiles flag (the current behavior is itself race-prone because a file might be be deleted between catalog listing and query execution time, triggering FileNotFoundExceptions on executors (which are handled in a way that does respect ignoreMissingFIles)).

This PR updates InMemoryFileIndex to guard the log-and-ignore-FileNotFoundException behind the existing spark.sql.files.ignoreMissingFiles flag.

Note: this is a change of default behavior, so I think it needs to be mentioned in release notes.

How was this patch tested?

New unit tests to simulate file-deletion race conditions, tested with both values of the ignoreMissingFIles flag.

…iles=true (default false)

SparkQA · 2019-05-21T22:06:05Z

Test build #105639 has finished for PR 24668 at commit a31c08a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

joshrosen-stripe · 2019-05-21T22:47:04Z

That most recent CI run has some legitimate looking test failures:

org.apache.spark.sql.hive.PartitionProviderCompatibilitySuite
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite
org.apache.spark.sql.hive.HiveMetadataCacheSuite

The failures appear to be related to Hive CTAS and/or partitioned tables.

JoshRosen · 2019-05-21T23:26:23Z

It looks like this change is breaking the ability to drop a catalog table whose underlying files don't exist / have been deleted. In DropTableCommand we have

catalog.refreshTable(tableName)
catalog.dropTable(tableName, ifExists, purge)

Here, refresh() is both clearing old caches (file listing, cached tables, views) and is repopulating some of them (re-listing) and that's triggering the error.

To avoid this, I think I can more narrowly-scope this patch's changes to only propagate FileNotFoundException if it occurs for non-root-path listings.

SparkQA · 2019-05-22T00:34:19Z

Test build #105646 has finished for PR 24668 at commit 69f8db6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-05-22T01:52:46Z

Yea, I was thinking about respecting this option but just decided to follow the existent way to push the fix quick. I agree with it.

HyukjinKwon

@JoshRosen, since it's a behaviour change, can we update SQL migration guide as well (https://github.com/apache/spark/tree/master/docs)? At least I know one customer who will faces this issue right away :D.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

JoshRosen · 2019-05-22T02:15:08Z

@HyukjinKwon, thanks for the pointer to the migration guide (I couldn't remember if this lived in spark-website or in this repo).

Do you think this is safe to target for a 2.4.x backport release? Or should we make it 3.x only?

I'll tentatively write the migration guide as though it's only targeting 3.x and will update it if we think a backport is okay. I don't feel too strongly about the 2.4.x backport (since I can always internally cherry-pick it for an internal build (where I can be less worried about breaking changes)).

viirya · 2019-05-22T02:26:02Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

    }
  }

+  test("SPARK-27676: InMemoryFileIndex respects ignoreMissingFiles config for non-root paths") {


There is one config parallelPartitionDiscoveryThreshold can control code path of partition discovery. With the default value, this only tests serial listing?

Good point. In the case of a parallel listing, this would cause the listing Spark job to fail with a FileNotFoundException (after maxTaskRetries attempts to list the missing file).

In the interests of complete test coverage, I'll update the test case to exercise the parallel listing path, too.

(Combinatorial test coverage is hard!)

HyukjinKwon · 2019-05-22T02:26:19Z

Oh, I was thinking we should do it 3.x only .. i thought it might be quite breaking change although this PR adds a configuration to keep previous behaviour. I prefer not to backport but I am fine if you or somebody strongly feels.

JoshRosen · 2019-05-22T03:05:08Z

I took a stab at writing a migration guide entry, but the resulting entry is a bit subtle (like this bug):

Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless spark.sql.files.ignoreMissingFiles is true (default false). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during REFRESH TABLE), not during query execution: the net change is that spark.sql.files.ignoreMissingFiles is now obeyed during table file listing / query planning, not only at query execution time.

Feedback is very welcome here, especially if you can think of a clearer way to describe this change.

SparkQA · 2019-05-22T03:55:28Z

Test build #105655 has finished for PR 24668 at commit 0c1eba3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-22T05:00:25Z

Test build #105648 has finished for PR 24668 at commit 24ad834.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-22T06:13:22Z

Test build #105659 has finished for PR 24668 at commit 86c3a9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-22T06:14:01Z

Test build #105657 has finished for PR 24668 at commit 88dc6b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…inst case discussed during apache#24672 review)

SparkQA · 2019-05-22T07:05:01Z

Test build #105667 has finished for PR 24668 at commit 42e8b98.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-05-22T07:21:40Z

jenkins retest this please

SparkQA · 2019-05-22T08:45:50Z

Test build #105670 has finished for PR 24668 at commit 42e8b98.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-05-22T14:31:09Z

jenkins retest this please

HyukjinKwon · 2019-05-22T14:48:44Z

retest this please

JoshRosen · 2019-05-22T14:57:51Z

FYI, there's an interesting discussion over at #24672 (comment) which illustrates some of costs of supporting feature-flagged backwards-compatibility w.r.t. the old behavior: if we switch to using more optimized FileSystem listing APIs (e.g. listLocatedStatus()) then only in some implementations we become vulnerable to a problem where a symptom that causes a single leaf file to go missing (e.g. a single file in a directory is deleted after it appears in a directory listing but before its block locations are fetched) can lead to all of that file's siblings being dropped / ignored as missing (even though only one file was deleted / failing its block location lookup cal).

This is tricky because we're trying to define behavior for race conditions where the end user has expressed that they don't care about missing files (ignoreMissingFiles = true).

SparkQA · 2019-05-22T17:51:15Z

Test build #105696 has finished for PR 24668 at commit 42e8b98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-05-24T01:31:28Z

On further reflection, it's not necessarily safe to ignore deletions at the root level because that still leaves us vulnerable to certain races (e.g. if we globStatus on the driver to list the first level, then pass those paths to InMemoryFileIndex, then delete one of the paths before InMemoryFileIndex begins its listing then we might miss data).

However, if you actually do delete underlying data on purpose then an explicit REFRESH TABLE is supposed to allow you to query the remaining data. This can create some interesting behavior inconsistencies: for example, attempting to initially create a table from a non-existent root path fails loudly with a "path not found" exception, but if you create a table from an existent root path, delete the path, and REFRESH TABLE then you'll have an empty table.

Given these existing behaviors, it's somewhat tricky to fix the "root path throws FileNotFoundException" case without breaking existing behaviors.

However, consider a workload which never does REFRESH TABLE: presumably every one of the rootPaths existed when the InMemoryFileIndex was initially constructed, so it should be fine to fail-fast during initial construction for non-existent paths but then ignore non-existence at the root during refresh!

I'm going to give that a try now.

This reverts commit 2a6240b.

JoshRosen · 2019-05-24T02:35:55Z

I can't see a clean way to handle the root path case without substantial risk of breaking existing code, so I've amended the comment to explain this.

I think this PR probably represents the best current trade-off between preserving existing behavior and detecting a certain subset of race conditions with high precision (at the cost of poorer recall).

SparkQA · 2019-05-24T05:51:51Z

Test build #105744 has finished for PR 24668 at commit 58e9544.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-migration-guide-upgrade.md

HyukjinKwon · 2019-06-19T23:55:15Z

Okie. @JoshRosen Can we resolve conflicts? I Will take a look and get this in

SparkQA · 2019-06-24T07:05:01Z

Test build #106818 has finished for PR 24668 at commit d9c5903.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-24T07:32:04Z

retest this please

SparkQA · 2019-06-24T10:35:47Z

Test build #106825 has finished for PR 24668 at commit d9c5903.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-26T00:11:10Z

Merged to master.

joshrosen-stripe and others added 2 commits May 20, 2019 19:31

Only ignore FileNotFoundException when spark.sql.files.ignoreMissingF…

05f9228

…iles=true (default false)

Update test cases to reflect behavior change

a31c08a

Only non-root deletions should respect flag.

69f8db6

Remove debug code

24ad834

HyukjinKwon reviewed May 22, 2019

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-27676][SQL] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles~~ [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles May 22, 2019

HyukjinKwon reviewed May 22, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala Outdated Show resolved Hide resolved

Fix indentation

400a02b

viirya reviewed May 22, 2019

View reviewed changes

JoshRosen added 3 commits May 21, 2019 19:30

Add note to migration guide

0c1eba3

Test with parallel partition discovery

88dc6b6

Clarify migration guide comment

86c3a9d

JoshRosen mentioned this pull request May 22, 2019

[SPARK-27801][SQL] Improve performance of InMemoryFileIndex.listLeafFiles for HDFS directories with many files #24672

Closed

Strengthen test assertions further (to fix bug in tests and guard aga…

42e8b98

…inst case discussed during apache#24672 review)

JoshRosen added 3 commits May 23, 2019 19:17

Work in progress towards fixing races for root file deletion

2a6240b

Revert "Work in progress towards fixing races for root file deletion"

97bac91

This reverts commit 2a6240b.

Update comment to clarify exception for root paths

58e9544

HyukjinKwon reviewed Jun 14, 2019

View reviewed changes

docs/sql-migration-guide-upgrade.md Show resolved Hide resolved

dongjoon-hyun added the SQL label Jun 14, 2019

Merge remote-tracking branch 'origin/master' into SPARK-27676

d9c5903

HyukjinKwon approved these changes Jun 24, 2019

View reviewed changes

HyukjinKwon closed this in d83f84a Jun 26, 2019

Deegue mentioned this pull request Mar 26, 2020

[SPARK-29786][SQL] Fix MetaException when dropping a partition not exists on HDFS #26422

Closed

[SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles #24668

[SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles #24668

Uh oh!

Conversation

JoshRosen commented May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 21, 2019

Uh oh!

joshrosen-stripe commented May 21, 2019

Uh oh!

JoshRosen commented May 21, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

HyukjinKwon commented May 22, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JoshRosen commented May 22, 2019

Uh oh!

viirya May 22, 2019

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 22, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 22, 2019

Uh oh!

JoshRosen commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

JoshRosen commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

JoshRosen commented May 22, 2019

Uh oh!

HyukjinKwon commented May 22, 2019

Uh oh!

JoshRosen commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

JoshRosen commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoshRosen commented May 24, 2019

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

Uh oh!

HyukjinKwon commented Jun 19, 2019

Uh oh!

SparkQA commented Jun 24, 2019

Uh oh!

HyukjinKwon commented Jun 24, 2019

Uh oh!

SparkQA commented Jun 24, 2019

Uh oh!

HyukjinKwon commented Jun 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

JoshRosen commented May 21, 2019 •

edited

Loading

JoshRosen commented May 24, 2019 •

edited

Loading