[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines. #10116

nongli · 2015-12-03T01:00:20Z

This patch adds timing metrics to portions of the execution. Row at a time pipelining means these
metrics can only be added at the end of blocking phases. The metrics add in this patch should be
interpreted as a event timeline where 0 means the beginning of the execution.

The metric is computed as when the last task of the blocking operator computes. This makes sense
if it is interpreted as a timeline and gives a measure of wall clock latency of that phase.

For example, in a plan with
Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: the end of the first
agg (when it's done the agg and not yet returned the results) and similar for the second agg.
This patch adds a timing to each that is the time since the beginning. For example it might
contain
Agg1: 10 seconds
Agg2: 11 seconds
This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 returning results +
exchange + agg2 took 1 second. We can post process this timeline to get the time spent entirely
in one pipeline.

This adds the metrics to Agg and Sort but we should add more in subsequent patches. The patch
also does not account of clock skew between the different machines. If this is a problem in
practice, we can adjust for that as well if ntp is not used.

This patch adds timing metrics to portions of the execution. Row at a time pipelining means these metrics can only be added at the end of blocking phases. The metrics add in this patch should be interpreted as a event timeline where 0 means the beginning of the execution. The metric is computed as when the last task of the blocking operator computes. This makes sense if it is interpreted as a timeline and gives a measure of wall clock latency of that phase. For example, in a plan with Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: the end of the first agg (when it's done the agg and not yet returned the results) and similar for the second agg. This patch adds a timing to each that is the time since the beginning. For example it might contain Agg1: 10 seconds Agg2: 11 seconds This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 returning results + exchange + agg2 took 1 second. We can post process this timeline to get the time spent entirely in one pipeline. This adds the metrics to Agg and Sort but we should add more in subsequent patches. The patch also does not account of clock skew between the different machines. If this is a problem in practice, we can adjust for that as well if ntp is not used.

nongli · 2015-12-03T01:03:08Z

I've attached an example running a tpcds queries. We can see some useful things. Looking at the chain of 3 aggregations in the top middle, the first took 36.8 seconds to complete. This includes all the time in the operators before. The next aggregation finished at 36.8 (meaning it is itself simple) and the third aggregation finished at 1 min.

gatorsmile · 2015-12-03T01:11:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala

This is a nice feature for performance analysis!

I am wondering if we can detect whether the machine clocks are synced when starting Spark?

SparkQA · 2015-12-03T01:30:59Z

Test build #47104 has finished for PR 10116 at commit c386e56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-01-29T02:43:37Z

@nongli ISTM this is good performance metrics for users. You do not keep working on it?
One question; why do you compute finish time from a plan start instead of accumulated duration time for each blocking stage?

maropu · 2016-02-02T04:45:47Z

@nongli ping

nongli · 2016-02-02T21:33:26Z

I don't have a lot of time to work on this. Do you have any interest in taking it over?

with regard to your question, I think there are many ways we can present this. I generally find the "user wall clock timeline" to be the easiest to think about.

maropu · 2016-02-03T09:16:37Z

@nongli okay and I'll do it.

rxin · 2016-06-15T22:21:54Z

@maropu can you take this over?

gatorsmile reviewed Dec 3, 2015
View reviewed changes

asfgit closed this in 1a33f2e Jun 15, 2016

maropu mentioned this pull request Jun 17, 2016

[SPARK-12113][SQL] Add some timing metrics for blocking pipelines #13736

Closed

maropu added a commit to maropu/spark that referenced this pull request Nov 18, 2016

Rework based on apache#10116

ab3f0b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines. #10116

[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines. #10116

Uh oh!

nongli commented Dec 3, 2015

Uh oh!

nongli commented Dec 3, 2015

Uh oh!

gatorsmile Dec 3, 2015

Uh oh!

SparkQA commented Dec 3, 2015

Uh oh!

maropu commented Jan 29, 2016

Uh oh!

maropu commented Feb 2, 2016

Uh oh!

nongli commented Feb 2, 2016

Uh oh!

maropu commented Feb 3, 2016

Uh oh!

rxin commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines. #10116

[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines. #10116

Uh oh!

Conversation

nongli commented Dec 3, 2015

Uh oh!

nongli commented Dec 3, 2015

Uh oh!

gatorsmile Dec 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 3, 2015

Uh oh!

maropu commented Jan 29, 2016

Uh oh!

maropu commented Feb 2, 2016

Uh oh!

nongli commented Feb 2, 2016

Uh oh!

maropu commented Feb 3, 2016

Uh oh!

rxin commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants