[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter #16340

JoshRosen · 2016-12-19T21:40:41Z

What changes were proposed in this pull request?

In order to respond to task cancellation, Spark tasks must periodically check TaskContext.isInterrupted(), but this check is missing on a few critical read paths used in Spark SQL, including FileScanRDD, JDBCRDD, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189).

This patch aims to fix this problem by adding TaskContext.isInterrupted() checks to these paths. Note that I could have used InterruptibleIterator to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined InterruptibleIterator-style logic into existing iterator subclasses.

How was this patch tested?

Tested manually in spark-shell with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here.

JoshRosen · 2016-12-19T21:41:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala


-    CompletionIterator[InternalRow, Iterator[InternalRow]](rowsIterator, close())
+    CompletionIterator[InternalRow, Iterator[InternalRow]](
+      new InterruptibleIterator(context, rowsIterator), close())


I suppose I could also have added the check into resultSetToSparkInternalRows but that function is exposed for use outside of Spark internals. Also, I think that JDBCRDD is going to be slow enough that the performance impact here shouldn't be noticeable.

JoshRosen · 2016-12-19T21:42:19Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java

  private byte[] arr = new byte[1024 * 1024];
  private Object baseObject = arr;
  private final long baseOffset = Platform.BYTE_ARRAY_OFFSET;
+  private final TaskContext taskContext = TaskContext.get();


I didn't want to change the constructor, hence this pattern.

JoshRosen · 2016-12-19T21:42:34Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java

+    // to avoid performance overhead. This check is added here in `loadNext()` instead of in
+    // `hasNext()` because it's technically possible for the caller to be relying on
+    // `getNumRecords()` instead of `hasNext()` to know when to stop.
+    if (taskContext != null && taskContext.isInterrupted()) {


TaskContext can be null in case this is used on the driver outside of the context of a specific task.

JoshRosen · 2016-12-19T21:46:40Z

For a bit of context on the UnsafeSorter changes, note that there are currently 5 implementations of UnsafeSorterIterator but three of those are just chaining / merging wrappers around the two implementations modified here.

hvanhovell · 2016-12-19T23:32:20Z

LGTM - pending jenkins.

SparkQA · 2016-12-20T00:13:29Z

Test build #70378 has finished for PR 16340 at commit 236efe5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-12-20T00:18:52Z

Merging to master/2.1. Thanks!

…DD & UnsafeSorter ## What changes were proposed in this pull request? In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189). This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses. ## How was this patch tested? Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here. Author: Josh Rosen <[email protected]> Closes #16340 from JoshRosen/sql-task-interruption. (cherry picked from commit 5857b9a) Signed-off-by: Herman van Hovell <[email protected]>

…DD & UnsafeSorter In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in apache#16189). This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses. Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here. Author: Josh Rosen <[email protected]> Closes apache#16340 from JoshRosen/sql-task-interruption.

…anRDD, JDBCRDD & UnsafeSorter This is a branch-2.0 backport of #16340; the original description follows: ## What changes were proposed in this pull request? In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189). This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses. ## How was this patch tested? Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here. Author: Josh Rosen <[email protected]> Closes #16357 from JoshRosen/sql-task-interruption-branch-2.0.

…DD & UnsafeSorter ## What changes were proposed in this pull request? In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in apache#16189). This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses. ## How was this patch tested? Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here. Author: Josh Rosen <[email protected]> Closes apache#16340 from JoshRosen/sql-task-interruption.

JoshRosen added 3 commits December 19, 2016 11:27

Fix FileScanRDD interruption.

08f9aa5

Fix for JDBCRDD interruption.

2d43d5a

Make UnsafeSorterIterator interruptible.

236efe5

JoshRosen commented Dec 19, 2016

View reviewed changes

asfgit closed this in 5857b9a Dec 20, 2016

JoshRosen deleted the sql-task-interruption branch December 20, 2016 19:47

JoshRosen mentioned this pull request Dec 20, 2016

[SPARK-18928][branch-2.0]Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter #16357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter #16340

[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter #16340

Uh oh!

JoshRosen commented Dec 19, 2016

Uh oh!

JoshRosen Dec 19, 2016

Uh oh!

JoshRosen Dec 19, 2016

Uh oh!

JoshRosen Dec 19, 2016

Uh oh!

JoshRosen commented Dec 19, 2016

Uh oh!

hvanhovell commented Dec 19, 2016

Uh oh!

SparkQA commented Dec 20, 2016

Uh oh!

hvanhovell commented Dec 20, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter #16340

[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter #16340

Uh oh!

Conversation

JoshRosen commented Dec 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

JoshRosen Dec 19, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Dec 19, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Dec 19, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Dec 19, 2016

Uh oh!

hvanhovell commented Dec 19, 2016

Uh oh!

SparkQA commented Dec 20, 2016

Uh oh!

hvanhovell commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hvanhovell commented Dec 20, 2016 •

edited

Loading