-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5681][Streaming] Add tracker status and stop to receive messages when stopping tracker #4467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @tdas. |
|
Test build #27091 has finished for PR 4467 at commit
|
|
@tdas Please take a look of this when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this visible to anything outside the class?
|
I realized that this is a tricky thing to fix while maintaining the stopGracefully semantics. Stop gracefully must ensure that if there are receivers that have already started, they must be stopped and all the received data processed before stopping completely. But what happens to the receivers that are still starting and have not registered yet? We have to wait for them to all be started, because if we dont, they may have started and pull data already, which may lead to loosing data. This is not good. So to solve this correctly. We probably need a Starting state as well. And stopGracefully must stasrt the stoppign process only after the system has reached Started state. So it has to wait for all the receivers to have started, otherwise it is hard to guarantee that all the receivers are correctly stopped. Also, this behavior must be properly unit tested with different state transitions, etc. Even before that, I would like to see what is the ideal state behavior -
|
|
The state behavior should be:
For the receivers that are still starting and have not registered yet, we have two options.
I think both options guarantee no data would be lost. I was thinking using option 2 in this pr. Because it should be more simple and, semantically we should not allow receivers to register and then process data after stop is called. I just realized that the current implementation of The important reason I think we don't choose option 1 to wait for receivers to all be started is, from the tracker's aspect, it has not idea what receivers are started or not. It just asynchronously waits for them to register and deregister. The receivers are visible to the tracker only when they are registered with it. When it is going to stop, because it doesn't know if there are receivers started but not registered yet, so it doesn't know how longer it should wait for them. Thus it is safer to make sure that the receivers must register before they start. |
|
It should accept addblock even it is stopping because there might be receivers processing data. Modified state behavior should be:
|
|
Test build #27273 has finished for PR 4467 at commit
|
|
Test build #27277 has finished for PR 4467 at commit
|
|
I dont think that correct. If the state becomes Stopping before a receiver That's why I think option 1 is cleanest. Wait for everything to start up, On Tue, Feb 10, 2015 at 11:45 PM, UCB AMPLab [email protected]
|
|
Let's analyze it clearly. The following is a simplified status transformation of the problem: time | tracker | receivers The above causes potential data loss. We want to avoid that. I agree. If we implement option 1, now the status transformation: time | tracker | receivers *we are going to wait for receivers that are started but not registered yet. t = n+2 | stopping | stopped:{A, B}; registered: {C} As you see, there will still be possible status that we have unregistered receiver C that processes data. This pr implements another approach. The receivers register first then do starting process: time | tracker | receivers |
|
@tdas, Do you have time to take a look of the analysis and the current implementation? |
|
Sorry I am bit tied with stuff. I will definitely take a look as soon as i On Fri, Feb 13, 2015 at 8:24 AM, Liang-Chi Hsieh [email protected]
|
|
Hey ... lets continue the discussion. I took a quick look at the logic, sounds good. Let me think a bit more and look at the code. |
|
@tdas Any updated ideas or comments? |
|
/cc @tdas Still busy? |
|
@viirya Sorry for slacking on this, been busy. I think understand your explanation. But I also spent some more time thinking about this ground up. Correct me if I am wrong, but the thing was getting stuck because of this line That can happen only if the receiver (C in your examples) that had not registered by the time "stop gracefully" was called is somehow running indefinitely. That is because it had registered and started running even if the system was stopping gracefully. Ideally, it should have never been allowed to register at all! That is, the ReceiverTracker should prevent any further registration as soon as it gets a stop signal. In this solution, the sequence of the events will be. time | tracker | receivers Since C is not allowed to start, its stops itself and the task completes. The jobs completes, running = false, tracker stops. t = 4 | stopped | stopped:{A, B}; stopped: {C} No data is lost in this case, because C is never allowed to be started. Isnt this a viable solution? If so, I think this is simpler than introducing another state. |
|
In We know that the thing was getting stuck because sometimes receiver will register after tracker is going to stop. So the receiver will not get stop message properly. As you said, we can disallow its registration. However, because it already pulls data, it may lose received data. This solution I proposed, also disallows receiver to register. Besides, it moves registration ahead of starting. So a receiver is only going to pull data after it is registered with the tracker. And all registered receivers are properly received stop messages. By doing that, we guarantee no data loss for receivers. |
…meout Conflicts: streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
|
Test build #29785 has finished for PR 4467 at commit
|
|
Test build #29786 has finished for PR 4467 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to send something back, or trackerActor.ask(msg)(askTimeout) will wait until timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that returning false if isTrackerStopping == true. And if onReceiverRegister receives false, it just throws an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally intended to let it timeout and throw exception. Returning false and throw exception is good too. I will update it.
|
Could you resolve the conflicts? |
…meout Conflicts: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
|
Test build #31869 has finished for PR 4467 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Timeout logic will change the stopGracefully semantics. The stopGracefully semantics should be if ssc.stop(..., stopGracefully = true) returns normally, no data loss will happen. But after your change, if ssc.stop(..., stopGracefully = true) returns normally, the user won't know if everything goes smooth. There is no signal here to help the user understand what happens internally.
I vote for keeping the original codes unchanged.
|
Test build #32342 has finished for PR 4467 at commit
|
…solve the race condition This is an alternative way to fix `SPARK-5681`. It minimizes the changes. Closes #4467 Author: zsxwing <[email protected]> Author: Liang-Chi Hsieh <[email protected]> Closes #6294 from zsxwing/pr4467 and squashes the following commits: 709ac1f [zsxwing] Fix the comment e103e8a [zsxwing] Move ReceiverTracker.stop into ReceiverTracker.stop f637142 [zsxwing] Address minor code style comments a178d37 [zsxwing] Move 'stopReceivers' to the event looop to resolve the race condition 51fb07e [zsxwing] Fix the code style 3cb19a3 [zsxwing] Merge branch 'master' into pr4467 b4c29e7 [zsxwing] Stop receiver only if we start it c41ee94 [zsxwing] Make stopReceivers private 7c73c1f [zsxwing] Use trackerStateLock to protect trackerState a8120c0 [zsxwing] Merge branch 'master' into pr4467 7b1d9af [zsxwing] "case Throwable" => "case NonFatal" 15ed4a1 [zsxwing] Register before starting the receiver fff63f9 [zsxwing] Use a lock to eliminate the race condition when stopping receivers and registering receivers happen at the same time. e0ef72a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout 19b76d9 [Liang-Chi Hsieh] Remove timeout. 34c18dc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout c419677 [Liang-Chi Hsieh] Fix style. 9e1a760 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout 355f9ce [Liang-Chi Hsieh] Separate register and start events for receivers. 3d568e8 [Liang-Chi Hsieh] Let receivers get registered first before going started. ae0d9fd [Liang-Chi Hsieh] Merge branch 'master' into tracker_status_timeout 77983f3 [Liang-Chi Hsieh] Add tracker status and stop to receive messages when stopping tracker.
…ks.jira.com/browse/BUG-40311) [SPARK-5681] [STREAMING] Move 'stopReceivers' to the event loop to resolve the race condition This is an alternative way to fix `SPARK-5681`. It minimizes the changes. Closes apache#4467 Author: zsxwing <[email protected]> Author: Liang-Chi Hsieh <[email protected]> Closes apache#6294 from zsxwing/pr4467 and squashes the following commits: 709ac1f [zsxwing] Fix the comment e103e8a [zsxwing] Move ReceiverTracker.stop into ReceiverTracker.stop f637142 [zsxwing] Address minor code style comments a178d37 [zsxwing] Move 'stopReceivers' to the event looop to resolve the race condition 51fb07e [zsxwing] Fix the code style 3cb19a3 [zsxwing] Merge branch 'master' into pr4467 b4c29e7 [zsxwing] Stop receiver only if we start it c41ee94 [zsxwing] Make stopReceivers private 7c73c1f [zsxwing] Use trackerStateLock to protect trackerState a8120c0 [zsxwing] Merge branch 'master' into pr4467 7b1d9af [zsxwing] "case Throwable" => "case NonFatal" 15ed4a1 [zsxwing] Register before starting the receiver fff63f9 [zsxwing] Use a lock to eliminate the race condition when stopping receivers and registering receivers happen at the same time. e0ef72a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout 19b76d9 [Liang-Chi Hsieh] Remove timeout. 34c18dc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout c419677 [Liang-Chi Hsieh] Fix style. 9e1a760 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout 355f9ce [Liang-Chi Hsieh] Separate register and start events for receivers. 3d568e8 [Liang-Chi Hsieh] Let receivers get registered first before going started. ae0d9fd [Liang-Chi Hsieh] Merge branch 'master' into tracker_status_timeout 77983f3 [Liang-Chi Hsieh] Add tracker status and stop to receive messages when stopping tracker.
Related to #4364.
Sometimes the receiver will be registered into tracker after
ssc.stop()is called. Especially whenstop()is called immediately afterstart(). So the receiver doesn't get theStopReceivermessage from the tracker. In this case, when you callstop()in graceful mode,stop()would get stuck indefinitely.This pr adds a status to
ReceiverTrackerand asksReceiverTrackerstop to receive messages when stopping.This also adds a timeout check to
ReceiverLauncher.stop.