Skip to content

Conversation

@mallman
Copy link
Contributor

@mallman mallman commented Jan 11, 2016

[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped.

This contribution is my original work, and I license this work to the Spark project under the project's open source license.

avoid a race condition where the standalone master attempts to build the
app's history UI before the event log is stopped
@mallman
Copy link
Contributor Author

mallman commented Jan 11, 2016

Changed Jira ref from SPARK-6950 to SPARK-12755. SPARK-6950 is an older, defunct ticket. Oops.

@mallman mallman changed the title [SPARK-6950][CORE] Stop the event logger before the DAG scheduler [SPARK-12755][CORE] Stop the event logger before the DAG scheduler Jan 11, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here to explain why this needs to be stopped before the DAGScheduler in order to prevent this change from being accidentally lost in the future?

@JoshRosen
Copy link
Contributor

Will this change still be relevant if we remove the Spark Master's embedded HistoryServer? In other words, does this race condition affect the standalone HistoryServer or only the Master history server? If it only affects the master then this isn't worth changing in master since we're going to remove the master's embedded history server for Spark 2.0. It may still be worthwhile for Spark 1.6, though, although there's a risk/benefit trade-off here.

@mallman
Copy link
Contributor Author

mallman commented Jan 18, 2016

Hi Josh,

Good questions. I may have submitted this PR incorrectly. Perhaps you can guide me in the right direction.

I submitted this PR for merging into master because my understanding is that's how all PR's for the Spark project should be created. And patches against master may be backported to earlier releases. However, I originally created and tested this patch on branch-1.5 because that's what we're currently running. So while this patch may be irrelevant to master (or Spark 2.0), it's relevant to the Spark 1.5 branch and presumably 1.6 as well. Under these circumstances, should I have submitted a PR against master as I have done? The code contribution guidelines state that only in a special case would a PR be opened against another branch. Does a patch with no or lesser relevance to the master branch compared to an earlier release branch qualify as a "special case"? And if so, which branch should I have submitted the PR against?

Thanks.

@mallman
Copy link
Contributor Author

mallman commented Jan 18, 2016

I should also state that my original motivation in submitting this patch was to address the confusing log messages

Application ... is still in progress, it may be terminated abnormally.

which I saw in the Spark master log for apps that actually terminated normally.

Also, it's just come to mind that this bug may explain another behavior I've seen—sometimes an app's event log is corrupted if it was configured to be compressed. If the log is uncompressed then the ability for the history reader to decode an "in progress" event log allows it to be processed as expected. However, if the event log is being written through a compressed output stream and is not properly flushed before it is processed, then the processing may fail because the file compression was incomplete. (As this just occurred to me I haven't tested this hypothesis, but it does sound like a plausible explanation.) If this is the case, then this patch should correct the problem with corrupt compressed event logs.

@srowen
Copy link
Member

srowen commented Jan 18, 2016

Yes, we'd prefer to make changes in master and then back port rather than apply only to a branch. I think the change is at worst a no-op for master? if so I think this change is fine. It sounds like it needs to be back-ported to 1.5.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #2395 has finished for PR 10700 at commit 6c32f77.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #2397 has finished for PR 10700 at commit 6c32f77.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

scheduler, take 3. This was my original intention, bungled twice :/
@mallman
Copy link
Contributor Author

mallman commented Jan 18, 2016

Sorry guys. I bungled the ordering of the stop() calls. That's what I get for doing a manual patch from a manual diff from another branch-1.5... 😞 This patch passes all tests on branch-1.5. Can you please kick off a new test build in jenkins?

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #2398 has finished for PR 10700 at commit 57cade1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor Author

mallman commented Jan 21, 2016

Here are my current thoughts. Josh says this functionality is going to be removed in Spark 2.0. The bug this PR is designed to address manifests itself in Spark 1.5 in three ways (I'm aware of):

  1. Misleading log messages from the Master (reported above).
  2. Incomplete (aka "in progress") application event logs, which can be further divided into two scenarios:
    2.a. Incomplete uncompressed event log files. The log processor can recover these files.
    2.b. Incomplete compressed event log files. The compression output is truncated and unreadable by normal means. The history server reports a corrupted event log. I cannot definitively tie that symptom to this bug, but it agrees with my experience.

The most problematic of these is unrecoverable event logs. I've been frustrated by this before and turned off event log compression as a workaround. Since deploying a build with this patch to one of our dev clusters I haven't seen this problem again.

I don't see a simple way to write a test to support this PR.

Overall, I feel we should close this PR but keep a reference to it from Jira with a comment that Spark 1.5 and 1.6 users can try this patch—at their own risk—to address the described symptoms if they wish to. It's going into our own Spark 1.x builds.

I'll close this PR and the associated Jira issue within the next few days unless someone objects or wishes to continue discussion.

Thanks.

@srowen
Copy link
Member

srowen commented Jan 21, 2016

Are there downsides to merging this to master, even if the related functionality is about to be removed? it passes tests, and seems to improve an ordering of shutdown, and can be backported to fix an actual minor issue in previous releases. Tests would be cool but you're correct that this one could be really hard to trigger. I see no reason to close this?

asfgit pushed a commit that referenced this pull request Jan 25, 2016
[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped.

This contribution is my original work, and I license this work to the Spark project under the project's open source license.

Author: Michael Allman <[email protected]>

Closes #10700 from mallman/stop_event_logger_first.

(cherry picked from commit 4ee8191)
Signed-off-by: Sean Owen <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 25, 2016
[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped.

This contribution is my original work, and I license this work to the Spark project under the project's open source license.

Author: Michael Allman <[email protected]>

Closes #10700 from mallman/stop_event_logger_first.

(cherry picked from commit 4ee8191)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Jan 25, 2016

Merged to master/1.6/1.5

@asfgit asfgit closed this in 4ee8191 Jan 25, 2016
@mallman
Copy link
Contributor Author

mallman commented Jan 25, 2016

Thanks, @srowen.

@mallman mallman deleted the stop_event_logger_first branch June 29, 2016 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants