[SPARK-14476][SQL] Improve the physical plan visualization by adding meta info like table name and file path for data source. #12947

clockfly · 2016-05-06T04:57:29Z

What changes were proposed in this pull request?

Improve the physical plan visualization by adding meta info like table name and file path for data source.

Meta info InputPaths and TableName are newly added. Example:

scala> spark.range(10).write.saveAsTable("tt")
scala> spark.sql("select * from tt").explain()
== Physical Plan ==
WholeStageCodegen
:  +- BatchedScan HadoopFiles[id#13L] Format: ParquetFormat, InputPaths: file:/home/xzhong10/spark-linux/assembly/spark-warehouse/tt, PushedFilters: [], ReadSchema: struct<id:bigint>, TableName: default.tt

How was this patch tested?

manual tests.

Changes for UI:
Before:

After:

davies · 2016-05-06T06:12:33Z

@clockfly Can we show table name instead of HadoopFiles or together? If there is no table name, we could use the rightest part of path.

SparkQA · 2016-05-06T06:24:30Z

Test build #57962 has finished for PR 12947 at commit b1d01c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ata source like Hive table

clockfly · 2016-05-06T13:50:59Z

@davies

I made some changes in UI, please check whether it is better now?

scala> spark.sql("select * from tt").explain()
== Physical Plan ==
WholeStageCodegen
:  +- BatchedScan HadoopFiles default.tt[id#0L] Format: ParquetFormat, InputPaths: file:/home/xzhong10/spark-linux/assembly/spark-warehouse/tt, PushedFilters: [], ReadSchema: struct<id:bigint>

SparkQA · 2016-05-06T15:18:20Z

Test build #57995 has finished for PR 12947 at commit 438d70e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-06T17:27:38Z

LGTM.

@marmbrus Could you take a quick look on this?

yhuai · 2016-05-06T17:40:46Z

@clockfly This PR does not truncate those long strings caused by long paths, right?

clockfly · 2016-05-07T02:06:12Z

@yhuai

This PR truncate the long path by 100 chars
https://github.com/apache/spark/pull/12947/files#diff-4b3d7a5ee80fb01203fcd345c073ae46R186

yhuai · 2016-05-07T18:57:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

  override def simpleString: String = {
-    val metadataEntries = for ((key, value) <- metadata.toSeq.sorted) yield s"$key: $value"
+    val metadataEntries = for ((key, value) <- metadata.toSeq.sorted) yield {
+      key + ": " + StringUtils.abbreviate(value, 100)


Can you play with some long paths and see if 100 is good value (it will be also good to put screenshot in the PR description)?

clockfly · 2016-05-10T02:05:12Z

@yhuai
Thanks for the reminder, the css has been updated for the long tooltip.

SparkQA · 2016-05-10T03:19:30Z

Test build #58195 has finished for PR 12947 at commit b6b38a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-10T03:25:12Z

"HadoopFiles" isn't very useful, and sometimes the files are not even in Hadoop (e.g. it is just using Hadoop APIs to read S3). Can we say "scan" instead, and say the name of the data source?

e.g.

"parquet scan default.jt4"

clockfly · 2016-05-10T10:14:31Z

How is the new UI?

And for explain:

scala> spark.sql("select * from jt4").explain()
== Physical Plan ==
WholeStageCodegen
:  +- BatchedScan Scan parquet default.jt4[id#0L] Format: ParquetFormat, InputPaths: file:/home/xzhong10/aaaaaaaaaa/bbbbbbbb/ccccccccccc/ddddddddd/eeeeeeee/ffffffffff/gggggggg/hhhhhh..., PushedFilters: [], ReadSchema: struct<id:bigint>

SparkQA · 2016-05-10T10:56:31Z

Test build #58229 has finished for PR 12947 at commit f0a0951.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-10T16:54:22Z

How does it look like when there is no table but just files?

clockfly · 2016-05-10T16:59:12Z

Something like "Scan parquet" , but without table name suffix. I will show you an example.

clockfly · 2016-05-10T18:08:32Z

For load:

scala> spark.read.format("json").load("/home/xzhong10/people.json")
res5: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> res5.explain()
== Physical Plan ==
WholeStageCodegen
:  +- Scan json[age#20L,name#21] Format: JSON, InputPaths: file:/home/xzhong10/people.json, PushedFilters: [], ReadSchema: struct<age:bigint,name:string>

SparkQA · 2016-05-10T19:31:17Z

Test build #58250 has finished for PR 12947 at commit b3e9775.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-10T20:34:29Z

sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css

+/* Breaks the long string like file path when showing tooltips */
+.tooltip-inner {
+  word-wrap:break-word;
+}


Add a newline here

davies · 2016-05-10T20:41:53Z

Could you also update the screen shot in PR description?

clockfly · 2016-05-11T03:14:35Z

@davies, Updated.

SparkQA · 2016-05-11T04:29:52Z

Test build #58318 has finished for PR 12947 at commit 59f816f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-11T04:50:39Z

Thanks - merging in master/2.0.

…meta info like table name and file path for data source. ## What changes were proposed in this pull request? Improve the physical plan visualization by adding meta info like table name and file path for data source. Meta info InputPaths and TableName are newly added. Example: ``` scala> spark.range(10).write.saveAsTable("tt") scala> spark.sql("select * from tt").explain() == Physical Plan == WholeStageCodegen : +- BatchedScan HadoopFiles[id#13L] Format: ParquetFormat, InputPaths: file:/home/xzhong10/spark-linux/assembly/spark-warehouse/tt, PushedFilters: [], ReadSchema: struct<id:bigint>, TableName: default.tt ``` ## How was this patch tested? manual tests. Changes for UI: Before: ![ui_before_change](https://cloud.githubusercontent.com/assets/2595532/15064559/3d423e3c-1388-11e6-8099-7803ef496c4d.jpg) After: ![fix_long_string](https://cloud.githubusercontent.com/assets/2595532/15133566/8ad09e26-1696-11e6-939c-99b908249b9d.jpg) ![for_load](https://cloud.githubusercontent.com/assets/2595532/15157224/3ba95c98-171d-11e6-885a-de0ee8dec27c.jpg) Author: Sean Zhong <[email protected]> Closes #12947 from clockfly/spark-14476. (cherry picked from commit 61e0bdc) Signed-off-by: Reynold Xin <[email protected]>

davies · 2016-05-13T21:53:13Z

@clockfly It seems that this does not work with temporary tables, could you send an PR to fix that?

clockfly changed the title ~~[SPARK-14476][SQL] Improves the output of dataset.explain by adding source table names and file paths.~~ [SPARK-14476][SQL][WIP] Improves the output of dataset.explain by adding source table names and file paths. May 6, 2016

[SPARK-14476][SQL] Display table name and path in physical plan for d…

438d70e

…ata source like Hive table

clockfly changed the title ~~[SPARK-14476][SQL][WIP] Improves the output of dataset.explain by adding source table names and file paths.~~ [SPARK-14476][SQL][WIP] Improve the physical plan visualization by adding meta info like table name and file path for data source. May 6, 2016

clockfly force-pushed the spark-14476 branch from b1d01c8 to 438d70e Compare May 6, 2016 14:00

clockfly changed the title ~~[SPARK-14476][SQL][WIP] Improve the physical plan visualization by adding meta info like table name and file path for data source.~~ [SPARK-14476][SQL] Improve the physical plan visualization by adding meta info like table name and file path for data source. May 7, 2016

yhuai reviewed May 7, 2016
View reviewed changes

break long string when showing tooltips

b6b38a7

show more descriptive info for HadoopFsRelation.

f0a0951

Improve the UI output

b3e9775

davies reviewed May 10, 2016
View reviewed changes

update style

59f816f

asfgit closed this in 61e0bdc May 11, 2016

dongjoon-hyun mentioned this pull request Sep 8, 2018

[SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information like file path to event log #22353

Closed

[SPARK-14476][SQL] Improve the physical plan visualization by adding meta info like table name and file path for data source. #12947

[SPARK-14476][SQL] Improve the physical plan visualization by adding meta info like table name and file path for data source. #12947

Uh oh!

Conversation

clockfly commented May 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

davies commented May 6, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

clockfly commented May 6, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

davies commented May 6, 2016

Uh oh!

yhuai commented May 6, 2016

Uh oh!

clockfly commented May 7, 2016

Uh oh!

yhuai May 7, 2016

Choose a reason for hiding this comment

Uh oh!

clockfly commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

rxin commented May 10, 2016

Uh oh!

clockfly commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

rxin commented May 10, 2016

Uh oh!

clockfly commented May 10, 2016

Uh oh!

clockfly commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

davies May 10, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented May 10, 2016

Uh oh!

clockfly commented May 11, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

rxin commented May 11, 2016

Uh oh!

davies commented May 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clockfly commented May 6, 2016 •

edited

Loading