Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Mar 24, 2016

What changes were proposed in this pull request?

There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 (HADOOP-10622). If we interrupt some thread running Shell.runCommand, we may hit this issue.

This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:

  • offsetLog.add
  • FileStreamSource.getOffset

They will create a file using HDFS API and call Shell.runCommand to set the file permission.

How was this patch tested?

Existing unit tests.

zsxwing added 2 commits March 24, 2016 10:42
There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue.

This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:

- offsetLog.add
- FileStreamSource.getOffset

They will create a file using HDFS API and call Shell.runCommand to set the file permission.
dev/run-tests Outdated

exec python -u ./dev/run-tests.py "$@"
set -e
for i in `seq 1 300`; do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to run this multiple times and see if it won't hang in DataFrameReaderWriterSuite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove them later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from #11922 which can reproduce this issue

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54064 has finished for PR 11940 at commit 7680aa0.

  • This patch fails executing the dev/run-tests script.
  • This patch merges cleanly.
  • This patch adds no public classes.

new HDFSMetadataLog[CompositeOffset](sqlContext, checkpointFile("offsets"))

/** A monitor to protect "uninterruptible" and "interrupted" */
private val uninterruptibleLock = new Object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(new Object()) ... but you're not actually using this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I wanted to use this separate lock but forgot

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54063 has finished for PR 11940 at commit 8c0f5d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 24, 2016

Test build #54065 has started for PR 11940 at commit 69124c1.

Looks this patch did fix the issue. DataFrameReaderWriterSuite has run 12 times till now and all passed.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 24, 2016

ping @tdas @marmbrus to take a final look.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54065 has finished for PR 11940 at commit 69124c1.

  • This patch fails executing the dev/run-tests script.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 24, 2016

Test build #54065 has finished for PR 11940 at commit 69124c1.

This patch fails executing the dev/run-tests script.
This patch merges cleanly.
This patch adds no public classes.

This is because of the known flaky ContinuousQueryListenerSuite.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54081 has finished for PR 11940 at commit d8dcd04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

uninterruptibleLock.synchronized {
uninterruptible = true
// Clear the interrupted status if it's set.
if (Thread.interrupted()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interrupted = Thread.interrupted()? Or are you trying to not unset the interrupted status if it was true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interrupted = Thread.interrupted() is simpler. Should not set interrupted to false when Thread.interrupted() return false.

@vanzin
Copy link
Contributor

vanzin commented Mar 25, 2016

LGTM; looking at the code after my comment I think the shorter code is safe, but that's a super minor thing anyway.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 25, 2016

@vanzin Oh, no... interrupted = Thread.interrupted() is wrong. It should not change interrupted to false if Thread.interrupted() returns false. Reverting it.

val newData = uniqueSources.flatMap(s => s.getOffset.map(o => s -> o))
val newData = runUninterruptiblyInMicroBatchThread {
uniqueSources.flatMap(s => s.getOffset.map(o => s -> o))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand (for future devs), why these should be uninterruptible. I think this needs more documentation.

@tdas
Copy link
Contributor

tdas commented Mar 25, 2016

This logic is a little complex. Could you add some unit tests to make sure this is correct?

@vanzin
Copy link
Contributor

vanzin commented Mar 25, 2016

It should not change interrupted to false if Thread.interrupted() returns false

Is that because there are multiple calls to runUninterruptiblyInMicroBatchThread, and you want to keep the interrupted status across those calls? If that's the case, you probably should remove the code in L158 (where interrupted is set back to false).

* Run `f` uninterruptibly in "microBatchThread". "microBatchThread" won't be interrupted before
* returning from `f`.
*/
private def runUninterruptiblyInMicroBatchThread[T](f: => T): T = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name confused me a lot. Why isnt this just runUninterruptibly? ...InMicroBatchThread sounds like it will be called from some other thread, and the func will be offloaded to run in the microbatchthread.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 25, 2016

Is that because there are multiple calls to runUninterruptiblyInMicroBatchThread, and you want to keep the interrupted status across those calls? If that's the case, you probably should remove the code in L158 (where interrupted is set back to false).

Yes just for safety. Since L157 calls microBatchThread.interrupt(), it's safe to set interrupted back to false in L158.

@vanzin
Copy link
Contributor

vanzin commented Mar 25, 2016

Since L157 calls microBatchThread.interrupt(), it's safe

I see. Can I suggest that you change the name of the variable to shouldInterruptThread or something that more explicitly says that it's used to control whether the runUninterruptibly method should interrupt the thread after running the code?

@zsxwing
Copy link
Member Author

zsxwing commented Mar 25, 2016

I see; so how about simplifying things a bit. Can I suggest that you change the name of the variable to shouldInterruptThread or something that more explicitly says that it's used to control whether the runUninterruptibly method should interrupt the thread after running the code?

Renamed.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 25, 2016

I'm going to merge this one after tests pass to unblock other PRs. Will address further comments in another PR.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54197 has finished for PR 11940 at commit 37343ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54200 has finished for PR 11940 at commit 9809acf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54198 has finished for PR 11940 at commit d8dcd04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54203 has finished for PR 11940 at commit 45f1452.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 25, 2016

Merging to master

@asfgit asfgit closed this in b554b3c Mar 25, 2016
@zsxwing zsxwing deleted the workaround-for-HADOOP-10622 branch March 25, 2016 20:30
ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 28, 2016
## What changes were proposed in this pull request?

Extract the workaround for HADOOP-10622 introduced by apache#11940 into UninterruptibleThread so that we can test and reuse it.

## How was this patch tested?

Unit tests

Author: Shixiong Zhu <[email protected]>

Closes apache#11971 from zsxwing/uninterrupt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants