Skip to content

Conversation

@pengbo
Copy link
Contributor

@pengbo pengbo commented May 21, 2019

What changes were proposed in this pull request?

Currently, the SparkSQL UI page shows only actual metric info in each SparkPlan node. However with statistics info may help us understand how the plan is designed and the reason why it runs slowly. This PR is to show numOutputRows metric's statistic info of BroadcastHashJoinExec node on SparkSQL UI page when it's available.

The main changes:

  1. Add stats field in SQLMetric and passing it to SQLPlanMetric to show on UI page when it's available
  2. Initialize numOutputRows with rowCount stats in logicalPlan of BroadcastHashJoinExec, thanks to [SPARK-27747][SQL] add a logical plan link in the physical plan #24626

SPARK-27482-1

How was this patch tested?

Regarding unit test has been added, manual UI test has been tested

@pengbo
Copy link
Contributor Author

pengbo commented May 21, 2019

@cloud-fan , Can you please have a look?

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105683 has finished for PR 24666 at commit 18a14b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented May 23, 2019

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105719 has finished for PR 24666 at commit 18a14b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pengbo
Copy link
Contributor Author

pengbo commented May 23, 2019

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105727 has finished for PR 24666 at commit d05d5f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105745 has finished for PR 24666 at commit aae3653.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val accumulatorId: Long,
val metricType: String)
val metricType: String,
val stats: Long = -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all metric has its corresponding statistics (e.g. peakMemory), and not all statistics are long type. We should think of a better place to carry the statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can put a val stats: Option[Statistics] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks for your comments.

The idea is that each SQL metric can have a statistic value (-1 means not available/initialized). I set the statistic type to Long is because SQL Metric's value is always Long type as well. class SQLMetric(val metricType: String, initValue: Long = 0L)

Put Option[Statistics] in SQLMetricInfo doesn't sound quite right though. It means that all SQL metrics have an attribute including rowCount, size & column stats.

Let me know your feedback, thanks in advance.

}
}

def stringStats(value: Long): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should handle stats in stringValue, different metrics may need to look at different stats and display different things.

Copy link
Contributor

@cloud-fan cloud-fan May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one example: we can also display the difference between real row count and estimated row count, e.g. 10X, 0.01X, etc. Something like row count: 4, est: 40 (10X)

@gengliangwang
Copy link
Member

gengliangwang commented May 29, 2019

@pengbo The title is a bit confusing...I think we should make it more clear, e.g.
”Show estimated BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page”

SQLMetrics.createMetric(
sparkContext,
"number of output rows",
logicalPlan.map(_.stats.rowCount.map(_.toLong).getOrElse(-1L)).getOrElse(-1L)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, for file sources, usually there is only sizeInBytes stats in logical plan level. So the estimated numOutputRows for logical plan should be empty for file sources.
What is the scenario of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for file source table, there will be row count stats if CBO is enabled.

@pengbo
Copy link
Contributor Author

pengbo commented Jun 2, 2019

”Show estimated BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page”

okay, thanks

@pengbo pengbo changed the title [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page [SPARK-27482][SQL][WEBUI] Show estimated BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page Jun 2, 2019
@pengbo pengbo closed this Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants