Skip to content

Conversation

@DaveDeCaprio
Copy link
Contributor

Nested query plans can get extremely large (hundreds of megabytes).

What changes were proposed in this pull request?

The PR puts in a limit on the nesting depth of trees to be printed when writing a plan string.

  • The default limit is 15, which allows for reasonably nested plans.
  • A new configuration parameter called spark.debug.maxToStringTreeDepth was added to control the depth.
  • When plans are truncated, "..." is printed to indicate that tree elements were removed.
  • A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit.

How was this patch tested?

A new unit test in QueryExecutionSuite which creates a highly nested plan and then ensures that the printed plan is not too long.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@DaveDeCaprio
Copy link
Contributor Author

This contribution is my original work and I license the work to the project under the project’s open source license.


// scalastyle:off
abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Logging {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure TreeNode should have the baggage of its own logger. The catalyst.trees package object extends logger and there is a comment that tree nodes should just use that logger, but I don't see a way to do that since loggers are protected.

I could add a log function to the package object and then use that from TreeNode:

private[trees] def logWarning 


/** Whether we have warned about plan string truncation yet. */
private val treeDepthWarningPrinted = new AtomicBoolean(false)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure where to put this code, so made a TreeNode companion object. If there is a better place let me know.

child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false)
addSuffix: Boolean = false,
maxDepth: Int = TreeNode.maxTreeToStringDepth): Unit = {
child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix and addSuffix are ignored in the original code but I can't tell why. Seems like they should be passed through from parent to child here. I adjusted, but it might not be a bug or maybe should go in a separate PR

Suggested change
child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth)
child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false, maxDepth)

@DaveDeCaprio
Copy link
Contributor Author

@MaxGekk and @hvanhovell, I think this PR is complementary to the PR for SPARK-26023 which recently made allowed for file dumping of queries.

It might make sense to only have the default depth apply to the toString function and not writing to a file. We could have a parameter to toFile to control the depth, but that would default to Int.MaxValue. toString would use the configured parameter.

@MaxGekk
Copy link
Member

MaxGekk commented Nov 18, 2018

@DaveDeCaprio What is the practical use case for limiting the plans? My initial goal in the PRs #22429, #23018 and #23039 is to dump whole plans and gather more info for debugging. Your PR has an opposite purpose, it seems.

@DaveDeCaprio
Copy link
Contributor Author

The goal of this is to avoid creation of massive strings if toString gets called on a QueryExecution. In our use we have some large plans that generate plan strings in the hundreds of megabytes. These plans are efficiently executed because of caching and partitioning, but we are running into out of memory errors because of the QueryExecution.toString.

In our case, writing to a file would help the memory issue, but we'd still be writing 200Mb+ files for each query. (Note these queries take seconds to run because 99% of the work is cached, and we run a bunch of them).

@DaveDeCaprio
Copy link
Contributor Author

I originally wrote this against 2.3.0 where the toString was always called as part of newExecutionId. I'm not sure if that still happens in master.

@DaveDeCaprio
Copy link
Contributor Author

Any case where we use truncatedString to limit the output columns I also think makes sense to limit the plan depth.

@HeartSaVioR
Copy link
Contributor

I'm seeing both sides of needs: while I think dumping full plan into file is a good feature for debugging specific issue, retaining full plans for representing them to UI page have been a headache and three regarding issues (SPARK-23904, SPARK-25380, SPARK-26103) are filed in 3 months which doesn't look like a thing we can say end users should take a workaround.

One thing we may be aware is that huge plan is not generated not only from nested join, but also from lots of columns, like SPARK-23904. For SPARK-25380 we are not aware of which parts generate huge plan. So we might feel easier and flexible to just truncate to specific size rather than applying conditions.

@DaveDeCaprio
Copy link
Contributor Author

DaveDeCaprio commented Nov 19, 2018 via email

@MaxGekk
Copy link
Member

MaxGekk commented Nov 19, 2018

I assume we still want a config parameter for maximum size. Any suggestions on name: spark.debug.maxPlanStringSizeInMB?

I would propose to make it more SQL specific, and put the config to the spark.sql.debug namespace. Especially in the light of moving truncatedString to sql/catalyst (see #23039 )

@DaveDeCaprio
Copy link
Contributor Author

Closing this pull request in favor of #23169

dvallejo pushed a commit to Telefonica/spark that referenced this pull request Aug 31, 2021
The PR puts in a limit on the size of a debug string generated for a tree node.  Helps to fix out of memory errors when large plans have huge debug strings.   In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380.  AN alternative solution was proposed in apache#23076, but that solution doesn't address all the cases that can cause a large query.  This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with apache#22429, apache#23018 and apache#23039.  Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI.

- A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans.
- When plans are truncated, "..." is printed to indicate that it isn't a full plan
- A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit.

Unit tests were created for the new SizeLimitedWriter.  Also a unit test for TreeNode was created that checks that a long plan is correctly truncated.

Closes apache#23169 from DaveDeCaprio/text-plan-size.

Lead-authored-by: Dave DeCaprio <[email protected]>
Co-authored-by: David DeCaprio <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants