-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1112, 2156] use min akka frame size to decide how to send task results #1124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
Is there a use case for making the min frame size configurable with a parameter? |
|
Merged build triggered. |
|
Merged build started. |
|
@ash211 If there is a way to make the configuration delivered consistently to backend, we can use This is a quick fix for |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15887/ |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15888/ |
|
I'm not sure the implication of it, but it sounds like it's probably ok to On Wed, Jun 18, 2014 at 7:34 PM, Xiangrui Meng [email protected]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
200 KB may not enough , Its value should increase as the serialized DirectResult becomes larger .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with a double array of size very close to 10 * 1024 * 1024. The akka message overhead is about 30-60K. This PR doesn't fix the issues with receiving new tasks from the driver that are bigger than 10MB. @pwendell is working on it.
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15891/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slightly disingenuous name/comment, it's really the min maximum frame size.... Can keep the name, but maybe add some clarification in the comment.
|
LGTM. Note that this fix is "safe" in that in the most conservative case, we'll just return the result through the block manager rather than Akka. This may be a little slower, but should be guaranteed to work. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15894/ |
|
Merged build triggered. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15899/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 200K and should we change the similar code that does this for sending task closures (which also subtracts 1024)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We computed the difference between the size of the akka message and the size of serialized task result (~10M). The difference is smaller than 60K. I set 200K to be safe. Could you point me to the places where we use 1024?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kayousterhout I saw the line in CoarseGrainedSchedulerBackend. Should the overhead be bounded by a fixed size or proportional to the message size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: your last question, I have no idea...you definitely know more about this than I do at this point. Using the higher, 200K value seems like a safe alternative to what's currently there. It would be great to add this constant in AkkaUtils so we don't need to manually track this down if it changes again in the future.
|
Jenkins, retest this please. |
|
Merged build triggered. |
|
Merged build started. |
|
So this can make performance worse for folks that set spark.akka.frameSize in the SparkConf that's on the executor to be higher, right, because this commit means we effectively ignore that parameter setting? Can we just figure out what the max frame size actually is for the backend actor system? Also, this can lead to hung jobs if a user set the frame size to lower than the default of 10, right? @pwendell said you guys chose to ignore this case but it would be nice to add a comment explicitly stating this in the code in case it turns up as a bug later. |
|
For the first scenario, it won't make the performance worse because the system doesn't really work now for serialized task result of size between 10M and spark.akka.frameSize. But yes, the ideal solution is to get the conf and set the right frame size. Maybe we can first request the conf from the driver, and then create a new ActorSystem on the backend with the correct frame size. This saves us from thinking about different deploy modes. Any ActorSystem created using Now I'm testing whether |
|
@mengxr I think the scenario is where people have set SPARK_JAVA_OPTS in |
|
@mengxr this actually hung up while running tests: |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15896/ |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
I don't get it. As long as the actor systems are created from AkkaUtils.createActorSystem, the minimum value of the max frame size is 10M. All unit tests passed on Jenkins. |
|
I see -- you're right about the min frame size. @pwendell was talking about the first case, where I think the config parameter is currently being correctly honored |
|
Just tested |
|
@pwendell @kayousterhout I put an alternative solution in #1132 . Please let me know which do you prefer. |
|
Closing this in favor of #1132. |
This is an alternative solution to #1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than #1124. @pwendell Are there spark properties that might be different on the driver and on the executors? Author: Xiangrui Meng <[email protected]> Closes #1132 from mengxr/akka-bootstrap and squashes the following commits: 77ff32d [Xiangrui Meng] organize imports 68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor 46d332d [Xiangrui Meng] fix a test 7947c18 [Xiangrui Meng] increase slack size for akka 4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
This is an alternative solution to apache#1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than apache#1124. @pwendell Are there spark properties that might be different on the driver and on the executors? Author: Xiangrui Meng <[email protected]> Closes apache#1132 from mengxr/akka-bootstrap and squashes the following commits: 77ff32d [Xiangrui Meng] organize imports 68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor 46d332d [Xiangrui Meng] fix a test 7947c18 [Xiangrui Meng] increase slack size for akka 4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
This is an alternative solution to apache#1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than apache#1124. @pwendell Are there spark properties that might be different on the driver and on the executors? Author: Xiangrui Meng <[email protected]> Closes apache#1132 from mengxr/akka-bootstrap and squashes the following commits: 77ff32d [Xiangrui Meng] organize imports 68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor 46d332d [Xiangrui Meng] fix a test 7947c18 [Xiangrui Meng] increase slack size for akka 4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
This is an alternative solution to apache#1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than apache#1124. @pwendell Are there spark properties that might be different on the driver and on the executors? Author: Xiangrui Meng <[email protected]> Closes apache#1132 from mengxr/akka-bootstrap and squashes the following commits: 77ff32d [Xiangrui Meng] organize imports 68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor 46d332d [Xiangrui Meng] fix a test 7947c18 [Xiangrui Meng] increase slack size for akka 4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
* [CARMEL-6345] Support backup table command * repair * grammer * comment
Task results are sent either via akka directly or block manager indirectly, based on whether the size of the serialized task is smaller than spark.akka.frameSize. However, the result is actually sent back to the driver via the backend actor, which is initialized before receiving the SparkConf and hence it doesn't know the spark.akka.frameSize. That is the root cause of SPARK-1112.
A quick fix is using the min frame size to decide which route to go.
This PR also fixes SPARK-2156.