Skip to content

Conversation

@JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented May 21, 2019

What changes were proposed in this pull request?

Spark's InMemoryFileIndex contains two places where FileNotFound exceptions are caught and logged as warnings (during directory listing and block location lookup). This logic was added in #15153 and #21408.

I think that this is a dangerous default behavior because it can mask bugs caused by race conditions (e.g. overwriting a table while it's being read) or S3 consistency issues (there's more discussion on this in the JIRA ticket). Failing fast when we detect missing files is not sufficient to make concurrent table reads/writes or S3 listing safe (there are other classes of eventual consistency issues to worry about), but I think it's still beneficial to throw exceptions and fail-fast on the subset of inconsistencies / races that we can detect because that increases the likelihood that an end user will notice the problem and investigate further.

There may be some cases where users do want to ignore missing files, but I think that should be an opt-in behavior via the existing spark.sql.files.ignoreMissingFiles flag (the current behavior is itself race-prone because a file might be be deleted between catalog listing and query execution time, triggering FileNotFoundExceptions on executors (which are handled in a way that does respect ignoreMissingFIles)).

This PR updates InMemoryFileIndex to guard the log-and-ignore-FileNotFoundException behind the existing spark.sql.files.ignoreMissingFiles flag.

Note: this is a change of default behavior, so I think it needs to be mentioned in release notes.

How was this patch tested?

New unit tests to simulate file-deletion race conditions, tested with both values of the ignoreMissingFIles flag.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105639 has finished for PR 24668 at commit a31c08a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@joshrosen-stripe
Copy link
Contributor

That most recent CI run has some legitimate looking test failures:

org.apache.spark.sql.hive.PartitionProviderCompatibilitySuite
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite
org.apache.spark.sql.hive.HiveMetadataCacheSuite

The failures appear to be related to Hive CTAS and/or partitioned tables.

@JoshRosen
Copy link
Contributor Author

It looks like this change is breaking the ability to drop a catalog table whose underlying files don't exist / have been deleted. In DropTableCommand we have

catalog.refreshTable(tableName)
catalog.dropTable(tableName, ifExists, purge)

Here, refresh() is both clearing old caches (file listing, cached tables, views) and is repopulating some of them (re-listing) and that's triggering the error.

To avoid this, I think I can more narrowly-scope this patch's changes to only propagate FileNotFoundException if it occurs for non-root-path listings.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105646 has finished for PR 24668 at commit 69f8db6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Yea, I was thinking about respecting this option but just decided to follow the existent way to push the fix quick. I agree with it.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshRosen, since it's a behaviour change, can we update SQL migration guide as well (https://github.com/apache/spark/tree/master/docs)? At least I know one customer who will faces this issue right away :D.

@HyukjinKwon HyukjinKwon changed the title [SPARK-27676][SQL] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles May 22, 2019
@JoshRosen
Copy link
Contributor Author

@HyukjinKwon, thanks for the pointer to the migration guide (I couldn't remember if this lived in spark-website or in this repo).

Do you think this is safe to target for a 2.4.x backport release? Or should we make it 3.x only?

I'll tentatively write the migration guide as though it's only targeting 3.x and will update it if we think a backport is okay. I don't feel too strongly about the 2.4.x backport (since I can always internally cherry-pick it for an internal build (where I can be less worried about breaking changes)).

}
}

test("SPARK-27676: InMemoryFileIndex respects ignoreMissingFiles config for non-root paths") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one config parallelPartitionDiscoveryThreshold can control code path of partition discovery. With the default value, this only tests serial listing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In the case of a parallel listing, this would cause the listing Spark job to fail with a FileNotFoundException (after maxTaskRetries attempts to list the missing file).

In the interests of complete test coverage, I'll update the test case to exercise the parallel listing path, too.

(Combinatorial test coverage is hard!)

@HyukjinKwon
Copy link
Member

Oh, I was thinking we should do it 3.x only .. i thought it might be quite breaking change although this PR adds a configuration to keep previous behaviour. I prefer not to backport but I am fine if you or somebody strongly feels.

@JoshRosen
Copy link
Contributor Author

I took a stab at writing a migration guide entry, but the resulting entry is a bit subtle (like this bug):

  • Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless spark.sql.files.ignoreMissingFiles is true (default false). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during REFRESH TABLE), not during query execution: the net change is that spark.sql.files.ignoreMissingFiles is now obeyed during table file listing / query planning, not only at query execution time.

Feedback is very welcome here, especially if you can think of a clearer way to describe this change.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105655 has finished for PR 24668 at commit 0c1eba3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105648 has finished for PR 24668 at commit 24ad834.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105659 has finished for PR 24668 at commit 86c3a9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105657 has finished for PR 24668 at commit 88dc6b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105667 has finished for PR 24668 at commit 42e8b98.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105670 has finished for PR 24668 at commit 42e8b98.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

jenkins retest this please

@HyukjinKwon
Copy link
Member

retest this please

@JoshRosen
Copy link
Contributor Author

FYI, there's an interesting discussion over at #24672 (comment) which illustrates some of costs of supporting feature-flagged backwards-compatibility w.r.t. the old behavior: if we switch to using more optimized FileSystem listing APIs (e.g. listLocatedStatus()) then only in some implementations we become vulnerable to a problem where a symptom that causes a single leaf file to go missing (e.g. a single file in a directory is deleted after it appears in a directory listing but before its block locations are fetched) can lead to all of that file's siblings being dropped / ignored as missing (even though only one file was deleted / failing its block location lookup cal).

This is tricky because we're trying to define behavior for race conditions where the end user has expressed that they don't care about missing files (ignoreMissingFiles = true).

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105696 has finished for PR 24668 at commit 42e8b98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

JoshRosen commented May 24, 2019

On further reflection, it's not necessarily safe to ignore deletions at the root level because that still leaves us vulnerable to certain races (e.g. if we globStatus on the driver to list the first level, then pass those paths to InMemoryFileIndex, then delete one of the paths before InMemoryFileIndex begins its listing then we might miss data).

However, if you actually do delete underlying data on purpose then an explicit REFRESH TABLE is supposed to allow you to query the remaining data. This can create some interesting behavior inconsistencies: for example, attempting to initially create a table from a non-existent root path fails loudly with a "path not found" exception, but if you create a table from an existent root path, delete the path, and REFRESH TABLE then you'll have an empty table.

Given these existing behaviors, it's somewhat tricky to fix the "root path throws FileNotFoundException" case without breaking existing behaviors.

However, consider a workload which never does REFRESH TABLE: presumably every one of the rootPaths existed when the InMemoryFileIndex was initially constructed, so it should be fine to fail-fast during initial construction for non-existent paths but then ignore non-existence at the root during refresh!

I'm going to give that a try now.

@JoshRosen
Copy link
Contributor Author

I can't see a clean way to handle the root path case without substantial risk of breaking existing code, so I've amended the comment to explain this.

I think this PR probably represents the best current trade-off between preserving existing behavior and detecting a certain subset of race conditions with high precision (at the cost of poorer recall).

@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105744 has finished for PR 24668 at commit 58e9544.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Okie. @JoshRosen Can we resolve conflicts? I Will take a look and get this in

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #106818 has finished for PR 24668 at commit d9c5903.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #106825 has finished for PR 24668 at commit d9c5903.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants