[SPARK-14131][SQL]Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite #11940

zsxwing · 2016-03-24T17:51:24Z

What changes were proposed in this pull request?

There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 (HADOOP-10622). If we interrupt some thread running Shell.runCommand, we may hit this issue.

This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:

offsetLog.add
FileStreamSource.getOffset

They will create a file using HDFS API and call Shell.runCommand to set the file permission.

How was this patch tested?

Existing unit tests.

There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue. This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now: - offsetLog.add - FileStreamSource.getOffset They will create a file using HDFS API and call Shell.runCommand to set the file permission.

zsxwing · 2016-03-24T17:54:40Z

dev/run-tests


-exec python -u ./dev/run-tests.py "$@"
+set -e
+for i in `seq 1 300`; do


Try to run this multiple times and see if it won't hang in DataFrameReaderWriterSuite

Will remove them later

This is from #11922 which can reproduce this issue

SparkQA · 2016-03-24T18:20:09Z

Test build #54064 has finished for PR 11940 at commit 7680aa0.

This patch fails executing the dev/run-tests script.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-24T18:21:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

    new HDFSMetadataLog[CompositeOffset](sqlContext, checkpointFile("offsets"))

+  /** A monitor to protect "uninterruptible" and "interrupted" */
+  private val uninterruptibleLock = new Object


(new Object()) ... but you're not actually using this?

Good catch. I wanted to use this separate lock but forgot

SparkQA · 2016-03-24T19:27:21Z

Test build #54063 has finished for PR 11940 at commit 8c0f5d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-24T20:26:55Z

Test build #54065 has started for PR 11940 at commit 69124c1.

Looks this patch did fix the issue. DataFrameReaderWriterSuite has run 12 times till now and all passed.

This reverts commit 7680aa0.

zsxwing · 2016-03-24T20:30:18Z

ping @tdas @marmbrus to take a final look.

SparkQA · 2016-03-24T20:32:22Z

Test build #54065 has finished for PR 11940 at commit 69124c1.

This patch fails executing the dev/run-tests script.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-24T20:34:58Z

Test build #54065 has finished for PR 11940 at commit 69124c1.

This patch fails executing the dev/run-tests script.
This patch merges cleanly.
This patch adds no public classes.

This is because of the known flaky ContinuousQueryListenerSuite.

SparkQA · 2016-03-24T22:10:32Z

Test build #54081 has finished for PR 11940 at commit d8dcd04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-03-25T17:56:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+    uninterruptibleLock.synchronized {
+      uninterruptible = true
+      // Clear the interrupted status if it's set.
+      if (Thread.interrupted()) {


interrupted = Thread.interrupted()? Or are you trying to not unset the interrupted status if it was true?

~~interrupted = Thread.interrupted() is simpler.~~ Should not set interrupted to false when Thread.interrupted() return false.

vanzin · 2016-03-25T18:02:58Z

LGTM; looking at the code after my comment I think the shorter code is safe, but that's a super minor thing anyway.

zsxwing · 2016-03-25T18:11:56Z

@vanzin Oh, no... interrupted = Thread.interrupted() is wrong. It should not change interrupted to false if Thread.interrupted() returns false. Reverting it.

tdas · 2016-03-25T18:13:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-    val newData = uniqueSources.flatMap(s => s.getOffset.map(o => s -> o))
+    val newData = runUninterruptiblyInMicroBatchThread {
+      uniqueSources.flatMap(s => s.getOffset.map(o => s -> o))
+    }


This is hard to understand (for future devs), why these should be uninterruptible. I think this needs more documentation.

tdas · 2016-03-25T18:16:33Z

This logic is a little complex. Could you add some unit tests to make sure this is correct?

vanzin · 2016-03-25T18:17:41Z

It should not change interrupted to false if Thread.interrupted() returns false

Is that because there are multiple calls to runUninterruptiblyInMicroBatchThread, and you want to keep the interrupted status across those calls? If that's the case, you probably should remove the code in L158 (where interrupted is set back to false).

tdas · 2016-03-25T18:23:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+   * Run `f` uninterruptibly in "microBatchThread". "microBatchThread" won't be interrupted before
+   * returning from `f`.
+   */
+  private def runUninterruptiblyInMicroBatchThread[T](f: => T): T = {


This name confused me a lot. Why isnt this just runUninterruptibly? ...InMicroBatchThread sounds like it will be called from some other thread, and the func will be offloaded to run in the microbatchthread.

zsxwing · 2016-03-25T18:24:14Z

Is that because there are multiple calls to runUninterruptiblyInMicroBatchThread, and you want to keep the interrupted status across those calls? If that's the case, you probably should remove the code in L158 (where interrupted is set back to false).

Yes just for safety. Since L157 calls microBatchThread.interrupt(), it's safe to set interrupted back to false in L158.

vanzin · 2016-03-25T18:30:50Z

Since L157 calls microBatchThread.interrupt(), it's safe

I see. Can I suggest that you change the name of the variable to shouldInterruptThread or something that more explicitly says that it's used to control whether the runUninterruptibly method should interrupt the thread after running the code?

zsxwing · 2016-03-25T18:42:23Z

I see; so how about simplifying things a bit. Can I suggest that you change the name of the variable to shouldInterruptThread or something that more explicitly says that it's used to control whether the runUninterruptibly method should interrupt the thread after running the code?

Renamed.

zsxwing · 2016-03-25T18:44:11Z

I'm going to merge this one after tests pass to unblock other PRs. Will address further comments in another PR.

SparkQA · 2016-03-25T19:39:35Z

Test build #54197 has finished for PR 11940 at commit 37343ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T19:56:18Z

Test build #54200 has finished for PR 11940 at commit 9809acf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T20:10:04Z

Test build #54198 has finished for PR 11940 at commit d8dcd04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T20:25:17Z

Test build #54203 has finished for PR 11940 at commit 45f1452.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-25T20:26:06Z

Merging to master

## What changes were proposed in this pull request? Extract the workaround for HADOOP-10622 introduced by apache#11940 into UninterruptibleThread so that we can test and reuse it. ## How was this patch tested? Unit tests Author: Shixiong Zhu <[email protected]> Closes apache#11971 from zsxwing/uninterrupt.

zsxwing added 2 commits March 24, 2016 10:42

Reproduce DataFrameReaderWriterSuite failure

7680aa0

zsxwing reviewed Mar 24, 2016
View reviewed changes

srowen reviewed Mar 24, 2016
View reviewed changes

Address

69124c1

Revert "Reproduce DataFrameReaderWriterSuite failure"

d8dcd04

This reverts commit 7680aa0.

zsxwing mentioned this pull request Mar 25, 2016

[SPARK-14134] [core] [test-maven] Change the package name used for shading classes. #11941

Closed

vanzin reviewed Mar 25, 2016
View reviewed changes

tdas reviewed Mar 25, 2016
View reviewed changes

Add document

9809acf

Rename

45f1452

asfgit closed this in b554b3c Mar 25, 2016

zsxwing deleted the workaround-for-HADOOP-10622 branch March 25, 2016 20:30

zsxwing mentioned this pull request Mar 25, 2016

[SPARK-14169][Core]Add UninterruptibleThread #11971

Closed

[SPARK-14131][SQL]Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite #11940

[SPARK-14131][SQL]Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite #11940

Uh oh!

Conversation

zsxwing commented Mar 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

zsxwing commented Mar 24, 2016

Uh oh!

zsxwing commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

zsxwing commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Mar 25, 2016

Uh oh!

zsxwing commented Mar 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Mar 25, 2016

Uh oh!

vanzin commented Mar 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Mar 25, 2016

Uh oh!

vanzin commented Mar 25, 2016

Uh oh!

zsxwing commented Mar 25, 2016

Uh oh!

zsxwing commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

zsxwing commented Mar 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants