-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11206] Support SQL UI on the history server (resubmit) #10061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure that it's not overlooked, please see my comment on the other PRs regarding whether this needs to actually hold a SQLListener instance or whether it can simply be an AtomicBoolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many unit tests use sqlContext.listener. Can you please suggest how to update the unit tests if we changed to use an AtomicBoolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that we need this in order to be able to return a value from createListenerAndUI.
|
One question that I had from the other PR: why is it okay to merge the event log streams from different SQLContexts into the same UI tab? Are relevant identifiers used by those SQL contexts (such as execution IDs) guaranteed to be unique as long as the SQLContexts belong to the same SparkContext? |
|
Hi @JoshRosen, the execution IDs are from the static |
|
Test build #46949 has finished for PR 10061 at commit
|
|
Looks ok to me; can't really comment on the multiple UIs issue (aside from what Carson said being correct, if that functionality is desired). |
|
Single tab is fine, just wanted to understand and make sure that it would be safe. |
|
alright, if no one else has comments I'll merge this after: retest this please. |
|
Test build #47161 has finished for PR 10061 at commit
|
|
Merging to master. |
| executorMetricsUpdateToJson(metricsUpdate) | ||
| case blockUpdated: SparkListenerBlockUpdated => | ||
| throw new MatchError(blockUpdated) // TODO(ekl) implement this | ||
| case _ => parse(mapper.writeValueAsString(event)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to add this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very possible that we silently pull in unnecessary information. If we have new event types, we should handle those explicitly instead of relying on this line. I am proposing to revert this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've said in similar conversations in other contexts, I'm strongly opposed to what you're suggesting. In fact I'm an advocate for exactly the opposite, and that's why I filed SPARK-12141.
BTW just removing that line would break the feature this patch is implementing, unless you write a whole lot of code to manually serialize all the SQL-related events.
Events are a public API, and they should be carefully crafted, since changing them affects user applications (including event logs). If there is unnecessary information in the event, then it's a bug in the event definition, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Events are a public API, and they should be carefully crafted, since changing them affects user applications (including event logs). If there is unnecessary information in the event, then it's a bug in the event definition, not here.
Yea. I totally agree. However, my concern is that having this line at here will make the developer harder to spot issues during the development. Since the serialization works automatically, we are not making a self-review on what will be serialized and what methods will be called during serialization a mandatory step, which makes the auditing work much harder. Although it introduces more work to the developer to make every event explicitly handled, when we review the pull request, we can clearly know what will be serialized and how a event is serialized when a pull request is submitted. What do you think?
btw, if I am missing any context, please let me know :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm perfectly ok with making auditing of these events harder if it means you're not writing manual serialization and de-serialization code like JsonProtocol.scala. The drawbacks of the latter are much worse for code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this is really not the right forum to discuss this. If you want to discuss big changes like you're proposing, please discuss on the bug I opened (referenced above) or start a thread on the mailing list.
Your suggestion of removing that line will just break the feature and, to restore it, would require an insane amount of code motion and new code to be written. To start with, the SQL events are not even available in "core", so you can't reference the classes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this is a terrible idea. Actually back in the days when we introduced magic serialization, I was against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you guys please comment on the bug I opened or the mailing list? Commenting on a long closed github PR is not really the best forum.
I'd really like to understand why you think automatic serialization is a bad idea, since we use it in so many places. I think exactly the opposite - manual serialization is unmaintainable, error-prone, and a waste of developer time.
Resubmit apache#9297 and apache#9991 On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution. To support SQL UI on the history server: 1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus. 2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module. 3. The new SQL events are written to event log using Jackson. 4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader. Author: Carson Wang <[email protected]> Closes apache#10061 from carsonwang/SqlHistoryUI. # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala Added more features: # More details to the SparkPlanInfo # Added the execution plan to action # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SparkPlanGraph.scala
Resubmit #9297 and #9991
On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.
To support SQL UI on the history server: