Skip to content

Conversation

@da-liii
Copy link
Contributor

@da-liii da-liii commented Apr 26, 2018

What changes were proposed in this pull request?

quote from #11205

Executors may never be idle. Currently we use the executor to tasks mapping relation to identify the status of executors, in maximum task failure situation, some TaskEnd events may never be delivered, which makes the related executor always be busy.

#19580 fix the incorrect number of running tasks.

This PR solves the problem in a similar way on the fact that TaskEnd events may never be delivered but StageCompleted events would be delivered eventually.

How was this patch tested?

Existing tests

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@jerryshao
Copy link
Contributor

Can you please check again with latest master code, I doubt the issue is not valid any more in the latest code.

@jiangxb1987
Copy link
Contributor

+1 seems you didn't test against the latest master code or 2.3?

@da-liii
Copy link
Contributor Author

da-liii commented Apr 27, 2018

I have checked the latest master code.

        // If the executor is no longer running any scheduled tasks, mark it as idle
        if (executorIdToTaskIds.contains(executorId)) {
          executorIdToTaskIds(executorId) -= taskId
          if (executorIdToTaskIds(executorId).isEmpty) {
            executorIdToTaskIds -= executorId
            allocationManager.onExecutorIdle(executorId)
          }
        }

To remove a executor, we have to mark it idle. To mark a executor idle, we have to invoke allocationManager.onExecutorIdle. In the latest master code, the two places that invokes the method are onExecutorAdded and onTaskEnd. For a running executor, the only way to mark it idle is the code snippet above.

#19580 makes sense on the basis that the event TaskEnd for some reason may be lost.

On the same basis, the Set executorIdToTaskIds(executorId) may never be empty.

the issue is not valid any more in the latest code

Please tell me why, thanks! The TaskEnd event will never be lost or other improves ? Which PR?

We are using Spark 2.1.1, and the issue is valid.

quote from #11205 by @jerryshao

Verified again, looks like the 2nd bullet is not valid anymore, I cannot reproduce it in latest master branch, this might have already been fixed in SPARK-13054.

Spark 2.1.1 is shipped with the fixes for SPARK-13054, but still we encountered the issue.

@da-liii da-liii closed this Apr 27, 2018
@da-liii da-liii reopened this Apr 27, 2018
@jerryshao
Copy link
Contributor

  1. We improve the DAGScheduler to always send TaskEnd message. So the issue I found before may not be valid.
  2. We refactored the LiveListenerQueue to make it more robust for internal listener. We cannot guarantee that event will never be lost, but the chance is quite small (SPARK-18838).

IMHO you (as a PR submitter) should validate this issue with latest code and make sure you can reproduce it with latest code.

@da-liii da-liii closed this May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants