Skip to content

Conversation

@andrewor14
Copy link
Contributor

Before we would get the following (benign) error if we called sc.stop() twice. This is because the listener bus would try to post the end event again even after it has already stopped. This happens occasionally when flaky tests fail, usually as a result of other sources of error. Either way we shouldn't be logging this error when it is not the cause of the failure.

ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerApplicationEnd(1425348445682)

@SparkQA
Copy link

SparkQA commented Mar 3, 2015

Test build #28228 has started for PR 4871 at commit 915db16.

  • This patch merges cleanly.

@srowen
Copy link
Member

srowen commented Mar 3, 2015

Just looking at the function it seems like this is the right thing to do. Is it right-er to move these two lines after stopped = true? This is read by non-synchronized methods so it's still possible the first two events happen before the SparkContext reports that it is stopped.

@SparkQA
Copy link

SparkQA commented Mar 3, 2015

Test build #28228 has finished for PR 4871 at commit 915db16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28228/
Test PASSed.

@o-mdr
Copy link

o-mdr commented Mar 3, 2015

def stop() {
     SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {

This line should have prevented 2 threads from going into the inner {...} and thefore changing the order of the lines wouldn't help.

@srowen if this is read by non-syncronised methods, only one of them will be able to go inside this method and all the others will wait, correct? Thus it could be the problem with the others methods, that run some non-synchronised logic before calling into this syncronised method.

@andrewor14 Having a test would help.

@srowen
Copy link
Member

srowen commented Mar 3, 2015

@o-mdr I'm referring to read of the stopped member elsewhere in this class. Yes of course synchronized excludes two threads from executing this block simultaneously. This is not true of reads of this field elsewhere in the class.

@JoshRosen
Copy link
Contributor

This line should have prevented 2 threads from going into the inner {...} and thefore changing the order of the lines wouldn't help.

I think the original issue here was that two sequential calls to stop() could cause spurious errors / warnings to occur, e.g.

sc.stop()
// ---- other stuff -----
sc.stop()

This can happen when sc.stop() is called in finally blocks; there are many cases that can lead to stop() being called twice on the same SparkContext, so it makes sense to have it be idempotent.

@srowen I'm not quite sure why stopped is set after posting the application end event or stopping the UI. Before I introduced the stopped variable, the old code would set dagScheduler = null and check that field's value to determine whether the SparkContext had stopped. That old code had the same property, so maybe there's a good reason why things were ordered this way (or maybe not; this could just be some vestigial artifact of how the code evolved over time).

@andrewor14
Copy link
Contributor Author

@o-mdr This is not a race condition, so there are no concurrency issues here. The real issue is that Spark calls sc.stop() internally sometimes when something bad happens, and then the application may decide to call sc.stop() again after that because it's what we recommend. This results in a benign error being logged that does not convey the root cause of the problem.

Also, note that any test we add here is not particularly meaningful because it would have passed even before this patch. For this reason I decided not to add one.

@SparkQA
Copy link

SparkQA commented Mar 3, 2015

Test build #28244 has started for PR 4871 at commit a14afc5.

  • This patch merges cleanly.

@andrewor14
Copy link
Contributor Author

Ah, @srowen @o-mdr I misunderstood what you guys were saying. I think what @o-mdr meant was that moving the two lines after stopped = true wouldn't change anything because we're in a synchronized block anyway. Yes, that is true, but I also agree with @srowen that it's a slightly more readable reorganization with no change in behavior. This is a super minor point, but I made the change anyway.

@srowen
Copy link
Member

srowen commented Mar 3, 2015

It makes some tiny difference; my observation was that the stopped flag's state can be observed outside this synchronized method. So its interleaving with other operations could matter. I have no idea if it does, but yeah, seems more logical to not put some operations before the flag state change. +1

@andrewor14
Copy link
Contributor Author

I see, since we don't synchronize around the uses of stopped. E.g. if postApplicationEnd reads stopped in another thread (somehow) then this would matter.

@SparkQA
Copy link

SparkQA commented Mar 3, 2015

Test build #28244 has finished for PR 4871 at commit a14afc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28244/
Test PASSed.

asfgit pushed a commit that referenced this pull request Mar 3, 2015
Before we would get the following (benign) error if we called `sc.stop()` twice. This is because the listener bus would try to post the end event again even after it has already stopped. This happens occasionally when flaky tests fail, usually as a result of other sources of error. Either way we shouldn't be logging this error when it is not the cause of the failure.
```
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerApplicationEnd(1425348445682)
```

Author: Andrew Or <[email protected]>

Closes #4871 from andrewor14/sc-stop and squashes the following commits:

a14afc5 [Andrew Or] Move code after code
915db16 [Andrew Or] Move code into code

(cherry picked from commit 6c20f35)
Signed-off-by: Andrew Or <[email protected]>
@andrewor14
Copy link
Contributor Author

Ok thanks everyone for your comments. I'm merging this into master and 1.2. I will back port this into 1.3 later after the release.

@asfgit asfgit closed this in 6c20f35 Mar 3, 2015
@andrewor14 andrewor14 deleted the sc-stop branch March 3, 2015 23:11
asfgit pushed a commit that referenced this pull request Mar 13, 2015
Before we would get the following (benign) error if we called `sc.stop()` twice. This is because the listener bus would try to post the end event again even after it has already stopped. This happens occasionally when flaky tests fail, usually as a result of other sources of error. Either way we shouldn't be logging this error when it is not the cause of the failure.
```
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerApplicationEnd(1425348445682)
```

Author: Andrew Or <[email protected]>

Closes #4871 from andrewor14/sc-stop and squashes the following commits:

a14afc5 [Andrew Or] Move code after code
915db16 [Andrew Or] Move code into code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants