Skip to content

Conversation

@carsonwang
Copy link
Contributor

Resubmit #9297 and #9991
On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.

To support SQL UI on the history server:

  1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus.
  2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module.
  3. The new SQL events are written to event log using Jackson.
  4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure that it's not overlooked, please see my comment on the other PRs regarding whether this needs to actually hold a SQLListener instance or whether it can simply be an AtomicBoolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many unit tests use sqlContext.listener. Can you please suggest how to update the unit tests if we changed to use an AtomicBoolean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that we need this in order to be able to return a value from createListenerAndUI.

@JoshRosen
Copy link
Contributor

One question that I had from the other PR: why is it okay to merge the event log streams from different SQLContexts into the same UI tab? Are relevant identifiers used by those SQL contexts (such as execution IDs) guaranteed to be unique as long as the SQLContexts belong to the same SparkContext?

@carsonwang
Copy link
Contributor Author

Hi @JoshRosen, the execution IDs are from the static SQLExecution object. So I think they are always unique. Yes, previously each SQLContext has its own SQLListener. Since the SQL events are now sent to the same event bus, all SQLListeners will receives SQL events executed from other SQLContexts. If we still want to keep one UI Tab for each SQLContext, I think we need keep something like a SQLContext ID in the SQL event so that the SQLListener knows if it need handle that event. Is there any strong requirement to keep one UI Tab for each SQLContext? I remember some users don't want so many UI tabs because they create many SQLContext.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46949 has finished for PR 10061 at commit 8222a0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class SubExprEliminationState(isNull: String, value: String)\n * class SparkPlanInfo(\n * class SQLMetricInfo(\n * case class SparkListenerSQLExecutionStart(\n * case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)\n

@vanzin
Copy link
Contributor

vanzin commented Dec 1, 2015

Looks ok to me; can't really comment on the multiple UIs issue (aside from what Carson said being correct, if that functionality is desired).

@JoshRosen
Copy link
Contributor

Single tab is fine, just wanted to understand and make sure that it would be safe.

@vanzin
Copy link
Contributor

vanzin commented Dec 3, 2015

alright, if no one else has comments I'll merge this after: retest this please.

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47161 has finished for PR 10061 at commit 8222a0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class SparkPlanInfo(\n * class SQLMetricInfo(\n * case class SparkListenerSQLExecutionStart(\n * case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)\n

@vanzin
Copy link
Contributor

vanzin commented Dec 4, 2015

Merging to master.

executorMetricsUpdateToJson(metricsUpdate)
case blockUpdated: SparkListenerBlockUpdated =>
throw new MatchError(blockUpdated) // TODO(ekl) implement this
case _ => parse(mapper.writeValueAsString(event))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to add this line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very possible that we silently pull in unnecessary information. If we have new event types, we should handle those explicitly instead of relying on this line. I am proposing to revert this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've said in similar conversations in other contexts, I'm strongly opposed to what you're suggesting. In fact I'm an advocate for exactly the opposite, and that's why I filed SPARK-12141.

BTW just removing that line would break the feature this patch is implementing, unless you write a whole lot of code to manually serialize all the SQL-related events.

Events are a public API, and they should be carefully crafted, since changing them affects user applications (including event logs). If there is unnecessary information in the event, then it's a bug in the event definition, not here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events are a public API, and they should be carefully crafted, since changing them affects user applications (including event logs). If there is unnecessary information in the event, then it's a bug in the event definition, not here.

Yea. I totally agree. However, my concern is that having this line at here will make the developer harder to spot issues during the development. Since the serialization works automatically, we are not making a self-review on what will be serialized and what methods will be called during serialization a mandatory step, which makes the auditing work much harder. Although it introduces more work to the developer to make every event explicitly handled, when we review the pull request, we can clearly know what will be serialized and how a event is serialized when a pull request is submitted. What do you think?

btw, if I am missing any context, please let me know :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm perfectly ok with making auditing of these events harder if it means you're not writing manual serialization and de-serialization code like JsonProtocol.scala. The drawbacks of the latter are much worse for code readability and maintainability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this is really not the right forum to discuss this. If you want to discuss big changes like you're proposing, please discuss on the bug I opened (referenced above) or start a thread on the mailing list.

Your suggestion of removing that line will just break the feature and, to restore it, would require an insane amount of code motion and new code to be written. To start with, the SQL events are not even available in "core", so you can't reference the classes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this is a terrible idea. Actually back in the days when we introduced magic serialization, I was against it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you guys please comment on the bug I opened or the mailing list? Commenting on a long closed github PR is not really the best forum.

I'd really like to understand why you think automatic serialization is a bad idea, since we use it in so many places. I think exactly the opposite - manual serialization is unmaintainable, error-prone, and a waste of developer time.

Parth-Brahmbhatt pushed a commit to Parth-Brahmbhatt/spark that referenced this pull request Jul 25, 2016
Resubmit apache#9297 and apache#9991
On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.

To support SQL UI on the history server:
1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus.
2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module.
3. The new SQL events are written to event log using Jackson.
4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader.

Author: Carson Wang <[email protected]>

Closes apache#10061 from carsonwang/SqlHistoryUI.

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala
#	sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

Added more features:
# More details to the SparkPlanInfo
# Added the execution plan to action

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SparkPlanGraph.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants