[SPARK-26103][SQL] Added maxDepth to limit the length of text plans #23076

DaveDeCaprio · 2018-11-18T06:38:02Z

Nested query plans can get extremely large (hundreds of megabytes).

What changes were proposed in this pull request?

The PR puts in a limit on the nesting depth of trees to be printed when writing a plan string.

The default limit is 15, which allows for reasonably nested plans.
A new configuration parameter called spark.debug.maxToStringTreeDepth was added to control the depth.
When plans are truncated, "..." is printed to indicate that tree elements were removed.
A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit.

How was this patch tested?

A new unit test in QueryExecutionSuite which creates a highly nested plan and then ensures that the printed plan is not too long.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ring.

AmplabJenkins · 2018-11-18T06:42:10Z

Can one of the admins verify this patch?

DaveDeCaprio · 2018-11-18T07:10:38Z

This contribution is my original work and I license the work to the project under the project’s open source license.

DaveDeCaprio · 2018-11-18T13:16:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala


 // scalastyle:off
-abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
+abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Logging {


Not sure TreeNode should have the baggage of its own logger. The catalyst.trees package object extends logger and there is a comment that tree nodes should just use that logger, but I don't see a way to do that since loggers are protected.

I could add a log function to the package object and then use that from TreeNode:

private[trees] def logWarning

DaveDeCaprio · 2018-11-18T13:17:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

+
+  /** Whether we have warned about plan string truncation yet. */
+  private val treeDepthWarningPrinted = new AtomicBoolean(false)
+}


I wasn't sure where to put this code, so made a TreeNode companion object. If there is a better place let me know.

DaveDeCaprio · 2018-11-18T13:20:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala

-    child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false)
+      addSuffix: Boolean = false,
+      maxDepth: Int = TreeNode.maxTreeToStringDepth): Unit = {
+    child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth)


Prefix and addSuffix are ignored in the original code but I can't tell why. Seems like they should be passed through from parent to child here. I adjusted, but it might not be a bug or maybe should go in a separate PR

Suggested change

child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth)

child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false, maxDepth)

DaveDeCaprio · 2018-11-18T13:37:38Z

@MaxGekk and @hvanhovell, I think this PR is complementary to the PR for SPARK-26023 which recently made allowed for file dumping of queries.

It might make sense to only have the default depth apply to the toString function and not writing to a file. We could have a parameter to toFile to control the depth, but that would default to Int.MaxValue. toString would use the configured parameter.

MaxGekk · 2018-11-18T15:35:28Z

@DaveDeCaprio What is the practical use case for limiting the plans? My initial goal in the PRs #22429, #23018 and #23039 is to dump whole plans and gather more info for debugging. Your PR has an opposite purpose, it seems.

DaveDeCaprio · 2018-11-18T16:06:51Z

The goal of this is to avoid creation of massive strings if toString gets called on a QueryExecution. In our use we have some large plans that generate plan strings in the hundreds of megabytes. These plans are efficiently executed because of caching and partitioning, but we are running into out of memory errors because of the QueryExecution.toString.

In our case, writing to a file would help the memory issue, but we'd still be writing 200Mb+ files for each query. (Note these queries take seconds to run because 99% of the work is cached, and we run a bunch of them).

DaveDeCaprio · 2018-11-18T16:07:49Z

I originally wrote this against 2.3.0 where the toString was always called as part of newExecutionId. I'm not sure if that still happens in master.

DaveDeCaprio · 2018-11-18T16:11:33Z

Any case where we use truncatedString to limit the output columns I also think makes sense to limit the plan depth.

HeartSaVioR · 2018-11-19T05:05:41Z

I'm seeing both sides of needs: while I think dumping full plan into file is a good feature for debugging specific issue, retaining full plans for representing them to UI page have been a headache and three regarding issues (SPARK-23904, SPARK-25380, SPARK-26103) are filed in 3 months which doesn't look like a thing we can say end users should take a workaround.

One thing we may be aware is that huge plan is not generated not only from nested join, but also from lots of columns, like SPARK-23904. For SPARK-25380 we are not aware of which parts generate huge plan. So we might feel easier and flexible to just truncate to specific size rather than applying conditions.

DaveDeCaprio · 2018-11-19T15:57:55Z

Limiting on pure size makes sense. I'd be willing to implement that in a different PR. Now that we are using a writer it should be pretty easy to do that. That was harder in the Spark 2.4 branch which is why I took the approach I did. A few questions about that: * I assume we still want a config parameter for maximum size. Any suggestions on name: spark.debug.maxPlanStringSizeInMB? * Is there a way to represent byte sizes in config? * Should the limit be on each element of the plan or the overall plan? Logical Plan, Physical Plan, Optimized Logical Plan, etc? I kind of like the idea of putting it on the treeString method itself since then it gets applied whereever that method is called rather than only in queryExecution.toString. Dave From: Jungtaek Lim <[email protected]> Sent: Sunday, November 18, 2018 11:07 PM To: apache/spark <[email protected]> Cc: David DeCaprio <[email protected]>; Mention <[email protected]> Subject: Re: [apache/spark] [SPARK-26103][SQL] Added maxDepth to limit the length of text plans (#23076) I'm seeing both sides of needs: while I think dumping full plan into file is a good feature for debugging specific issue, retaining full plans for representing them to UI page have been a headache and three regarding issues (SPARK-23904 <https://issues.apache.org/jira/browse/SPARK-23904> , SPARK-25380 <https://issues.apache.org/jira/browse/SPARK-25380> , SPARK-26103 <https://issues.apache.org/jira/browse/SPARK-26103> ) are filed in 3 months which doesn't look like a thing we can say end users should take a workaround. One thing we may be aware is that huge plan is not generated not only from nested join, but also from lots of columns, like SPARK-23904. For SPARK-25380 we are not aware of which parts generate huge plan. So we might feel easier and flexible to just truncate to specific size rather than applying conditions. - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23076 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AAzVuuhxE9ejYmlbwLnamQkHK NAGV8zpks5uwjxbgaJpZM4Ynzoz> . <https://github.com/notifications/beacon/AAzVulccV4Ui9iSrDibkHyHFCy06Tz8Cks5 uwjxbgaJpZM4Ynzoz.gif> <https://t.sidekickopen72.com/s1t/o/5/f18dQhb0S7kC8dDMPbW2n0x6l2B9gXrN7sKj6v 5KRN6W56jT_C3Lq__nW1pctGF2VxGHQf197v5Y04?si=7000000001068294&pi=d758f5d1-2c3 4-4ded-8e49-c782907b9060>

MaxGekk · 2018-11-19T16:47:58Z

I assume we still want a config parameter for maximum size. Any suggestions on name: spark.debug.maxPlanStringSizeInMB?

I would propose to make it more SQL specific, and put the config to the spark.sql.debug namespace. Especially in the light of moving truncatedString to sql/catalyst (see #23039 )

DaveDeCaprio · 2018-11-28T18:11:00Z

Closing this pull request in favor of #23169

The PR puts in a limit on the size of a debug string generated for a tree node. Helps to fix out of memory errors when large plans have huge debug strings. In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380. AN alternative solution was proposed in apache#23076, but that solution doesn't address all the cases that can cause a large query. This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with apache#22429, apache#23018 and apache#23039. Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI. - A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans. - When plans are truncated, "..." is printed to indicate that it isn't a full plan - A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. Unit tests were created for the new SizeLimitedWriter. Also a unit test for TreeNode was created that checks that a long plan is correctly truncated. Closes apache#23169 from DaveDeCaprio/text-plan-size. Lead-authored-by: Dave DeCaprio <[email protected]> Co-authored-by: David DeCaprio <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

DaveDeCaprio added 2 commits November 18, 2018 00:29

Added maxDepth to treeString which limits the depth of the printed st…

adc4f8e

…ring.

Added maxDepth to treeString which limits the depth of the printed st…

3a9743f

…ring.

DaveDeCaprio added 4 commits November 18, 2018 00:44

Removed outdated changes to QueryExecution

0681f05

Whitespace changes

c4cef02

Remove tabs

a3ef2ba

Remove tabs

7c178ab

DaveDeCaprio commented Nov 18, 2018

View reviewed changes

DaveDeCaprio added 2 commits November 18, 2018 07:53

Corrected test condition

45d4693

Reenabled file-based tests

42e4fcf

DaveDeCaprio mentioned this pull request Nov 28, 2018

[SPARK-26103][SQL] Limit the length of debug strings for query plans #23169

Closed

DaveDeCaprio closed this Nov 28, 2018

	child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth)
	child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false, maxDepth)

[SPARK-26103][SQL] Added maxDepth to limit the length of text plans #23076

[SPARK-26103][SQL] Added maxDepth to limit the length of text plans #23076

Uh oh!

Conversation

DaveDeCaprio commented Nov 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Nov 18, 2018

Uh oh!

DaveDeCaprio commented Nov 18, 2018

Uh oh!

DaveDeCaprio Nov 18, 2018

Choose a reason for hiding this comment

Uh oh!

DaveDeCaprio Nov 18, 2018

Choose a reason for hiding this comment

Uh oh!

DaveDeCaprio Nov 18, 2018

Choose a reason for hiding this comment

Uh oh!

DaveDeCaprio commented Nov 18, 2018

Uh oh!

MaxGekk commented Nov 18, 2018

Uh oh!

DaveDeCaprio commented Nov 18, 2018

Uh oh!

DaveDeCaprio commented Nov 18, 2018

Uh oh!

DaveDeCaprio commented Nov 18, 2018

Uh oh!

HeartSaVioR commented Nov 19, 2018

Uh oh!

DaveDeCaprio commented Nov 19, 2018 via email

Uh oh!

MaxGekk commented Nov 19, 2018

Uh oh!

DaveDeCaprio commented Nov 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants