-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25440][SQL] Dumping query execution info to a file #22429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #96093 has finished for PR 22429 at commit
|
|
@rednaxelafx Please, take a look at the PR. |
|
Test build #96097 has finished for PR 22429 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #96109 has finished for PR 22429 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Outdated
Show resolved
Hide resolved
rednaxelafx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the proposal, Max!
While I really really like the feature you're proposing (yes please get them in!), as they're really useful for debugging / diagnosis, there are a few things in the current implementation that might be deal breakers for me:
- Similar to what @hvanhovell mentioned, if we want to have a clean feature at the end, it would be better to pass the max fields argument down the call chain instead of changing the Spark conf.
Setting a Spark conf is a process-wide global change that will affect everybody, including code running in other threads. A save-use-restore sequence would only work for a single threaded scenario, and would otherwise be problematic in a multi-threaded scenario.
It would definitely be a lot more code change to pass that argument down, but I think it's worth the change (because the existing code before this change is too rigid to begin with anyway).
- The use of
java.io.StringWriter. I totally understand the motivation to useStringWriterfor the regular path so that both the regular and file-dumping paths can share the same implementation code via theWriterinterface.
But if you take a look at how StringWriter is implemented in OpenJDK8u: http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/2660b127b407/src/share/classes/java/io/StringWriter.java#l43
public class StringWriter extends Writer {
private StringBuffer buf;It's actually backed by a StringBuffer, which is worse than a StringBuilder in terms of extra synchronization overhead implied.
(Yes, modern JVMs to try to optimize StringBuffers via various techniques, including biased locking, lock elimination etc, but it's just less guaranteed to have good performance)
The best-performing way to do this is probably to implement a custom Writer that uses some sort of a "rope" implementation as the backing store, so that premature string concatenation is kept to a minimum. Whether or not we wanna go that way can be subject of further discussion, though.
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Outdated
Show resolved
Hide resolved
|
Test build #96154 has finished for PR 22429 at commit
|
|
May I ask you @hvanhovell @zsxwing to review the PR one more time. |
|
@gatorsmile @HyukjinKwon @viirya @rednaxelafx Are you ok with the proposed changes or there is something which blocks the PR for now? |
Thanks @MaxGekk for the contact list! I will ping them to gather more thoughts. |
|
jenkins, retest this, please |
|
Test build #98461 has finished for PR 22429 at commit
|
|
Test build #98465 has finished for PR 22429 at commit
|
|
Test build #98475 has finished for PR 22429 at commit
|
|
@hvanhovell Could you look at the PR, please. |
|
@cloud-fan @gatorsmile May I ask you to look at the PR. It stuck for a while by unclear reasons but I believe the proposed method |
|
This is hard to review, do you mean we should add |
Not to all but only to methods involved to producing textual representation of different plans. Those are Main changes are concentrated in |
|
@MaxGekk, I think the implementation looks complicated here: this looks introduces |
| override def simpleString: String = { | ||
| val orderByString = Utils.truncatedString(sortOrder, "[", ",", "]") | ||
| val outputString = Utils.truncatedString(output, "[", ",", "]") | ||
| override def simpleString(maxFields: Option[Int]): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just get rid of the maxFields? I think this makes the PR hard to read.
|
I took a super quick pass - the change actually quite looks okay in general to me. |
|
@HyukjinKwon @cloud-fan Thank you for looking at the PR. So, if I split the PR to 2 PRs:
, it would be better for review, right?
Yes, it is.
I just had a concern that there are much more places (~10) besides of the |
|
@boy-uber, for structured streaming, let's do it out of this PR. I think the actual change of this PR can be small (1.). We can change this API for structured streaming later if needed since this is just an internal private API. Change (2.) can be shared I guess even if we go for structured streaming stuff. |
Hi @HyukjinKwon thanks for the comment! My original suggestion was not related to structured streaming. It was to support structured logging for Spark plan, for example, get the Spark plan as a json blob, and send it to Spark Listener. This will make it easy to Spark users to write a Spark Listener plugin to get the plan and process the json in their code. Then people could parse and analyze a large number of Spark applications based on their plans. One use case for this is using machine learning to predict resource usage based on the Spark plans. This may not be related to this PR. If you or other people are interested, we could talk offline. |
|
Ooops i rished to read. Yea but still sounds related but orthogonal. Let's move it to mailing list. That should be the best place to discuss further. |
|
I am closing the PR since a part of it has been merged in #23018 already, and the rest is coming soon. |
|
Here is a PR introduces |
The PR puts in a limit on the size of a debug string generated for a tree node. Helps to fix out of memory errors when large plans have huge debug strings. In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380. AN alternative solution was proposed in apache#23076, but that solution doesn't address all the cases that can cause a large query. This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with apache#22429, apache#23018 and apache#23039. Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI. - A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans. - When plans are truncated, "..." is printed to indicate that it isn't a full plan - A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. Unit tests were created for the new SizeLimitedWriter. Also a unit test for TreeNode was created that checks that a long plan is correctly truncated. Closes apache#23169 from DaveDeCaprio/text-plan-size. Lead-authored-by: Dave DeCaprio <[email protected]> Co-authored-by: David DeCaprio <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose new method for debugging queries by dumping info about their execution to a file. It saves logical, optimized and physical plan similar to the
explain()method + generated code. One of the advantages of the method overexplainis it doesn't truncate output and doesn't not materializes full output as one string in memory which can cause OOMs.How was this patch tested?
Added a test which checks that new method dumps correct info about a query.