Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 15, 2018

What changes were proposed in this pull request?

In the PR, I propose new method for debugging queries by dumping info about their execution to a file. It saves logical, optimized and physical plan similar to the explain() method + generated code. One of the advantages of the method over explain is it doesn't truncate output and doesn't not materializes full output as one string in memory which can cause OOMs.

How was this patch tested?

Added a test which checks that new method dumps correct info about a query.

@SparkQA
Copy link

SparkQA commented Sep 15, 2018

Test build #96093 has finished for PR 22429 at commit 2ee75bc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [SPARK-25440][SQL] Dump query execution info to a file [SPARK-25440][SQL] Dumping query execution info to a file Sep 15, 2018
@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 15, 2018

@rednaxelafx Please, take a look at the PR.

@SparkQA
Copy link

SparkQA commented Sep 16, 2018

Test build #96097 has finished for PR 22429 at commit 9b2a3e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2018

Test build #96109 has finished for PR 22429 at commit ce2c086.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the proposal, Max!

While I really really like the feature you're proposing (yes please get them in!), as they're really useful for debugging / diagnosis, there are a few things in the current implementation that might be deal breakers for me:

  1. Similar to what @hvanhovell mentioned, if we want to have a clean feature at the end, it would be better to pass the max fields argument down the call chain instead of changing the Spark conf.
    Setting a Spark conf is a process-wide global change that will affect everybody, including code running in other threads. A save-use-restore sequence would only work for a single threaded scenario, and would otherwise be problematic in a multi-threaded scenario.

It would definitely be a lot more code change to pass that argument down, but I think it's worth the change (because the existing code before this change is too rigid to begin with anyway).

  1. The use of java.io.StringWriter. I totally understand the motivation to use StringWriter for the regular path so that both the regular and file-dumping paths can share the same implementation code via the Writer interface.

But if you take a look at how StringWriter is implemented in OpenJDK8u: http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/2660b127b407/src/share/classes/java/io/StringWriter.java#l43

public class StringWriter extends Writer {

    private StringBuffer buf;

It's actually backed by a StringBuffer, which is worse than a StringBuilder in terms of extra synchronization overhead implied.
(Yes, modern JVMs to try to optimize StringBuffers via various techniques, including biased locking, lock elimination etc, but it's just less guaranteed to have good performance)
The best-performing way to do this is probably to implement a custom Writer that uses some sort of a "rope" implementation as the backing store, so that premature string concatenation is kept to a minimum. Whether or not we wanna go that way can be subject of further discussion, though.

@SparkQA
Copy link

SparkQA commented Sep 18, 2018

Test build #96154 has finished for PR 22429 at commit 71ff7d1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 1, 2018

May I ask you @hvanhovell @zsxwing to review the PR one more time.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 2, 2018

@gatorsmile @HyukjinKwon @viirya @rednaxelafx Are you ok with the proposed changes or there is something which blocks the PR for now?

@hiboyang
Copy link

hiboyang commented Nov 2, 2018

@MaxGekk I sent email to spark dev list about structured plan logging, but did not get any response.

@boy-uber I guess It is better to speak about the feature to @bogdanrdc @hvanhovell @larturus

Thanks @MaxGekk for the contact list! I will ping them to gather more thoughts.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 5, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98461 has finished for PR 22429 at commit 76f4248.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98465 has finished for PR 22429 at commit f7de26d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98475 has finished for PR 22429 at commit bda6ac2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 9, 2018

@hvanhovell Could you look at the PR, please.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 11, 2018

@cloud-fan @gatorsmile May I ask you to look at the PR. It stuck for a while by unclear reasons but I believe the proposed method toFile could be pretty useful in troubleshooting different issues, and I just don't want to close it so easily.

@cloud-fan
Copy link
Contributor

This is hard to review, do you mean we should add maxFields: Option[Int] to all the string related methods?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 12, 2018

@cloud-fan

This is hard to review, do you mean we should add maxFields: Option[Int] to all the string related methods?

Not to all but only to methods involved to producing textual representation of different plans. Those are simpleString and verboseString in children of TreeNode + simpleString of StructType. The changes are trivial - I just made them by fixing compilation errors. Maybe I missed somethings but only because it wasn't touched by compiler.

Main changes are concentrated in QueryExecution.scala where everything begins from the toFile method. Other changes are consequences of that. The main idea is to write a OutputStream instead of strings materializations in memory. So, it should allow to avoid limits for string size and OOM which we see sometimes in corner cases.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 12, 2018

@MaxGekk, I think the implementation looks complicated here: this looks introduces None concept to indicate no limit which makes this PR hard to read. I think it's okay to expose that number to toFile. I think this is an internal API, right? People could just set whatever number they want I guess.
Fix me if I misread. If we get rid of it, I think the PR size will be drastically small.

override def simpleString: String = {
val orderByString = Utils.truncatedString(sortOrder, "[", ",", "]")
val outputString = Utils.truncatedString(output, "[", ",", "]")
override def simpleString(maxFields: Option[Int]): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just get rid of the maxFields? I think this makes the PR hard to read.

@HyukjinKwon
Copy link
Member

I took a super quick pass - the change actually quite looks okay in general to me.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 12, 2018

@HyukjinKwon @cloud-fan Thank you for looking at the PR.

So, if I split the PR to 2 PRs:

  1. Writing truncated plans to a file
  2. Control number of fields in truncated strings.

, it would be better for review, right?

I think it's okay to expose that number to toFile. I think this is an internal API, right?

Yes, it is.

People could just set whatever number they want I guess. Fix me if I misread.

I just had a concern that there are much more places (~10) besides of the toFile method where I have to pass some value for maxFields. In all such places I have to read a SQL config.

@HyukjinKwon
Copy link
Member

@boy-uber, for structured streaming, let's do it out of this PR. I think the actual change of this PR can be small (1.). We can change this API for structured streaming later if needed since this is just an internal private API. Change (2.) can be shared I guess even if we go for structured streaming stuff.

@hiboyang
Copy link

@boy-uber, for structured streaming, let's do it out of this PR. I think the actual change of this PR can be small (1.). We can change this API for structured streaming later if needed since this is just an internal private API. Change (2.) can be shared I guess even if we go for structured streaming stuff.

Hi @HyukjinKwon thanks for the comment! My original suggestion was not related to structured streaming. It was to support structured logging for Spark plan, for example, get the Spark plan as a json blob, and send it to Spark Listener. This will make it easy to Spark users to write a Spark Listener plugin to get the plan and process the json in their code. Then people could parse and analyze a large number of Spark applications based on their plans. One use case for this is using machine learning to predict resource usage based on the Spark plans. This may not be related to this PR. If you or other people are interested, we could talk offline.

@HyukjinKwon
Copy link
Member

Ooops i rished to read. Yea but still sounds related but orthogonal. Let's move it to mailing list. That should be the best place to discuss further.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 14, 2018

I am closing the PR since a part of it has been merged in #23018 already, and the rest is coming soon.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 27, 2018

Here is a PR introduces maxFields parameter to all function involved in creation of truncated strings of spark plans: #23159

@MaxGekk MaxGekk deleted the plan-to-file branch August 17, 2019 13:35
dvallejo pushed a commit to Telefonica/spark that referenced this pull request Aug 31, 2021
The PR puts in a limit on the size of a debug string generated for a tree node.  Helps to fix out of memory errors when large plans have huge debug strings.   In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380.  AN alternative solution was proposed in apache#23076, but that solution doesn't address all the cases that can cause a large query.  This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with apache#22429, apache#23018 and apache#23039.  Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI.

- A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans.
- When plans are truncated, "..." is printed to indicate that it isn't a full plan
- A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit.

Unit tests were created for the new SizeLimitedWriter.  Also a unit test for TreeNode was created that checks that a long plan is correctly truncated.

Closes apache#23169 from DaveDeCaprio/text-plan-size.

Lead-authored-by: Dave DeCaprio <[email protected]>
Co-authored-by: David DeCaprio <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants