-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27482][SQL][WEBUI] Show estimated BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page #24666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on SparkSQL UI page
|
@cloud-fan , Can you please have a look? |
|
ok to test |
|
Test build #105683 has finished for PR 24666 at commit
|
|
retest this please |
|
Test build #105719 has finished for PR 24666 at commit
|
…anned by SparkStrategies
|
retest this please |
|
Test build #105727 has finished for PR 24666 at commit
|
|
Test build #105745 has finished for PR 24666 at commit
|
| val accumulatorId: Long, | ||
| val metricType: String) | ||
| val metricType: String, | ||
| val stats: Long = -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all metric has its corresponding statistics (e.g. peakMemory), and not all statistics are long type. We should think of a better place to carry the statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can put a val stats: Option[Statistics] = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thanks for your comments.
The idea is that each SQL metric can have a statistic value (-1 means not available/initialized). I set the statistic type to Long is because SQL Metric's value is always Long type as well. class SQLMetric(val metricType: String, initValue: Long = 0L)
Put Option[Statistics] in SQLMetricInfo doesn't sound quite right though. It means that all SQL metrics have an attribute including rowCount, size & column stats.
Let me know your feedback, thanks in advance.
| } | ||
| } | ||
|
|
||
| def stringStats(value: Long): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should handle stats in stringValue, different metrics may need to look at different stats and display different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one example: we can also display the difference between real row count and estimated row count, e.g. 10X, 0.01X, etc. Something like row count: 4, est: 40 (10X)
|
@pengbo The title is a bit confusing...I think we should make it more clear, e.g. |
| SQLMetrics.createMetric( | ||
| sparkContext, | ||
| "number of output rows", | ||
| logicalPlan.map(_.stats.rowCount.map(_.toLong).getOrElse(-1L)).getOrElse(-1L))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, for file sources, usually there is only sizeInBytes stats in logical plan level. So the estimated numOutputRows for logical plan should be empty for file sources.
What is the scenario of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for file source table, there will be row count stats if CBO is enabled.
okay, thanks |
What changes were proposed in this pull request?
Currently, the SparkSQL UI page shows only actual metric info in each SparkPlan node. However with
statisticsinfo may help us understand how the plan is designed and the reason why it runs slowly. This PR is to shownumOutputRowsmetric'sstatisticinfo ofBroadcastHashJoinExecnode on SparkSQL UI page when it's available.The main changes:
statsfield inSQLMetricand passing it toSQLPlanMetricto show on UI page when it's availablenumOutputRowswith rowCountstatsinlogicalPlanofBroadcastHashJoinExec, thanks to [SPARK-27747][SQL] add a logical plan link in the physical plan #24626How was this patch tested?
Regarding unit test has been added, manual UI test has been tested