Skip to content

Conversation

@nongli
Copy link
Contributor

@nongli nongli commented Dec 3, 2015

This patch adds timing metrics to portions of the execution. Row at a time pipelining means these
metrics can only be added at the end of blocking phases. The metrics add in this patch should be
interpreted as a event timeline where 0 means the beginning of the execution.

The metric is computed as when the last task of the blocking operator computes. This makes sense
if it is interpreted as a timeline and gives a measure of wall clock latency of that phase.

For example, in a plan with
Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: the end of the first
agg (when it's done the agg and not yet returned the results) and similar for the second agg.
This patch adds a timing to each that is the time since the beginning. For example it might
contain
Agg1: 10 seconds
Agg2: 11 seconds
This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 returning results +
exchange + agg2 took 1 second. We can post process this timeline to get the time spent entirely
in one pipeline.

This adds the metrics to Agg and Sort but we should add more in subsequent patches. The patch
also does not account of clock skew between the different machines. If this is a problem in
practice, we can adjust for that as well if ntp is not used.

This patch adds timing metrics to portions of the execution. Row at a time pipelining means these
metrics can only be added at the end of blocking phases. The metrics add in this patch should be
interpreted as a event timeline where 0 means the beginning of the execution.

The metric is computed as when the last task of the blocking operator computes. This makes sense
if it is interpreted as a timeline and gives a measure of wall clock latency of that phase.

For example, in a plan with
Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: the end of the first
agg (when it's done the agg and not yet returned the results) and similar for the second agg.
This patch adds a timing to each that is the time since the beginning. For example it might
contain
  Agg1: 10 seconds
  Agg2: 11 seconds
This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 returning results +
exchange + agg2 took 1 second. We can post process this timeline to get the time spent entirely
in one pipeline.

This adds the metrics to Agg and Sort but we should add more in subsequent patches. The patch
also does not account of clock skew between the different machines. If this is a problem in
practice, we can adjust for that as well if ntp is not used.
@nongli
Copy link
Contributor Author

nongli commented Dec 3, 2015

q65

I've attached an example running a tpcds queries. We can see some useful things. Looking at the chain of 3 aggregations in the top middle, the first took 36.8 seconds to complete. This includes all the time in the operators before. The next aggregation finished at 36.8 (meaning it is itself simple) and the third aggregation finished at 1 min.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice feature for performance analysis!

I am wondering if we can detect whether the machine clocks are synced when starting Spark?

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47104 has finished for PR 10116 at commit c386e56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 29, 2016

@nongli ISTM this is good performance metrics for users. You do not keep working on it?
One question; why do you compute finish time from a plan start instead of accumulated duration time for each blocking stage?

@maropu
Copy link
Member

maropu commented Feb 2, 2016

@nongli ping

@nongli
Copy link
Contributor Author

nongli commented Feb 2, 2016

I don't have a lot of time to work on this. Do you have any interest in taking it over?

with regard to your question, I think there are many ways we can present this. I generally find the "user wall clock timeline" to be the easiest to think about.

@maropu
Copy link
Member

maropu commented Feb 3, 2016

@nongli okay and I'll do it.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

@maropu can you take this over?

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
maropu added a commit to maropu/spark that referenced this pull request Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants