-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines. #10116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch adds timing metrics to portions of the execution. Row at a time pipelining means these metrics can only be added at the end of blocking phases. The metrics add in this patch should be interpreted as a event timeline where 0 means the beginning of the execution. The metric is computed as when the last task of the blocking operator computes. This makes sense if it is interpreted as a timeline and gives a measure of wall clock latency of that phase. For example, in a plan with Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: the end of the first agg (when it's done the agg and not yet returned the results) and similar for the second agg. This patch adds a timing to each that is the time since the beginning. For example it might contain Agg1: 10 seconds Agg2: 11 seconds This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 returning results + exchange + agg2 took 1 second. We can post process this timeline to get the time spent entirely in one pipeline. This adds the metrics to Agg and Sort but we should add more in subsequent patches. The patch also does not account of clock skew between the different machines. If this is a problem in practice, we can adjust for that as well if ntp is not used.
|
I've attached an example running a tpcds queries. We can see some useful things. Looking at the chain of 3 aggregations in the top middle, the first took 36.8 seconds to complete. This includes all the time in the operators before. The next aggregation finished at 36.8 (meaning it is itself simple) and the third aggregation finished at 1 min. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice feature for performance analysis!
I am wondering if we can detect whether the machine clocks are synced when starting Spark?
|
Test build #47104 has finished for PR 10116 at commit
|
|
@nongli ISTM this is good performance metrics for users. You do not keep working on it? |
|
@nongli ping |
|
I don't have a lot of time to work on this. Do you have any interest in taking it over? with regard to your question, I think there are many ways we can present this. I generally find the "user wall clock timeline" to be the easiest to think about. |
|
@nongli okay and I'll do it. |
|
@maropu can you take this over? |

This patch adds timing metrics to portions of the execution. Row at a time pipelining means these
metrics can only be added at the end of blocking phases. The metrics add in this patch should be
interpreted as a event timeline where 0 means the beginning of the execution.
The metric is computed as when the last task of the blocking operator computes. This makes sense
if it is interpreted as a timeline and gives a measure of wall clock latency of that phase.
For example, in a plan with
Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: the end of the first
agg (when it's done the agg and not yet returned the results) and similar for the second agg.
This patch adds a timing to each that is the time since the beginning. For example it might
contain
Agg1: 10 seconds
Agg2: 11 seconds
This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 returning results +
exchange + agg2 took 1 second. We can post process this timeline to get the time spent entirely
in one pipeline.
This adds the metrics to Agg and Sort but we should add more in subsequent patches. The patch
also does not account of clock skew between the different machines. If this is a problem in
practice, we can adjust for that as well if ntp is not used.