-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26103][SQL] Added maxDepth to limit the length of text plans #23076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
This contribution is my original work and I license the work to the project under the project’s open source license. |
|
|
||
| // scalastyle:off | ||
| abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { | ||
| abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure TreeNode should have the baggage of its own logger. The catalyst.trees package object extends logger and there is a comment that tree nodes should just use that logger, but I don't see a way to do that since loggers are protected.
I could add a log function to the package object and then use that from TreeNode:
private[trees] def logWarning
|
|
||
| /** Whether we have warned about plan string truncation yet. */ | ||
| private val treeDepthWarningPrinted = new AtomicBoolean(false) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure where to put this code, so made a TreeNode companion object. If there is a better place let me know.
| child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false) | ||
| addSuffix: Boolean = false, | ||
| maxDepth: Int = TreeNode.maxTreeToStringDepth): Unit = { | ||
| child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix and addSuffix are ignored in the original code but I can't tell why. Seems like they should be passed through from parent to child here. I adjusted, but it might not be a bug or maybe should go in a separate PR
| child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth) | |
| child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false, maxDepth) |
|
@MaxGekk and @hvanhovell, I think this PR is complementary to the PR for SPARK-26023 which recently made allowed for file dumping of queries. It might make sense to only have the default depth apply to the toString function and not writing to a file. We could have a parameter to toFile to control the depth, but that would default to Int.MaxValue. toString would use the configured parameter. |
|
@DaveDeCaprio What is the practical use case for limiting the plans? My initial goal in the PRs #22429, #23018 and #23039 is to dump whole plans and gather more info for debugging. Your PR has an opposite purpose, it seems. |
|
The goal of this is to avoid creation of massive strings if toString gets called on a QueryExecution. In our use we have some large plans that generate plan strings in the hundreds of megabytes. These plans are efficiently executed because of caching and partitioning, but we are running into out of memory errors because of the QueryExecution.toString. In our case, writing to a file would help the memory issue, but we'd still be writing 200Mb+ files for each query. (Note these queries take seconds to run because 99% of the work is cached, and we run a bunch of them). |
|
I originally wrote this against 2.3.0 where the toString was always called as part of newExecutionId. I'm not sure if that still happens in master. |
|
Any case where we use truncatedString to limit the output columns I also think makes sense to limit the plan depth. |
|
I'm seeing both sides of needs: while I think dumping full plan into file is a good feature for debugging specific issue, retaining full plans for representing them to UI page have been a headache and three regarding issues (SPARK-23904, SPARK-25380, SPARK-26103) are filed in 3 months which doesn't look like a thing we can say end users should take a workaround. One thing we may be aware is that huge plan is not generated not only from nested join, but also from lots of columns, like SPARK-23904. For SPARK-25380 we are not aware of which parts generate huge plan. So we might feel easier and flexible to just truncate to specific size rather than applying conditions. |
|
Limiting on pure size makes sense. I'd be willing to implement that in a
different PR. Now that we are using a writer it should be pretty easy to do
that. That was harder in the Spark 2.4 branch which is why I took the
approach I did.
A few questions about that:
* I assume we still want a config parameter for maximum size. Any
suggestions on name: spark.debug.maxPlanStringSizeInMB?
* Is there a way to represent byte sizes in config?
* Should the limit be on each element of the plan or the overall plan?
Logical Plan, Physical Plan, Optimized Logical Plan, etc? I kind of like
the idea of putting it on the treeString method itself since then it gets
applied whereever that method is called rather than only in
queryExecution.toString.
Dave
From: Jungtaek Lim <[email protected]>
Sent: Sunday, November 18, 2018 11:07 PM
To: apache/spark <[email protected]>
Cc: David DeCaprio <[email protected]>; Mention
<[email protected]>
Subject: Re: [apache/spark] [SPARK-26103][SQL] Added maxDepth to limit the
length of text plans (#23076)
I'm seeing both sides of needs: while I think dumping full plan into file is
a good feature for debugging specific issue, retaining full plans for
representing them to UI page have been a headache and three regarding issues
(SPARK-23904 <https://issues.apache.org/jira/browse/SPARK-23904> ,
SPARK-25380 <https://issues.apache.org/jira/browse/SPARK-25380> ,
SPARK-26103 <https://issues.apache.org/jira/browse/SPARK-26103> ) are filed
in 3 months which doesn't look like a thing we can say end users should take
a workaround.
One thing we may be aware is that huge plan is not generated not only from
nested join, but also from lots of columns, like SPARK-23904. For
SPARK-25380 we are not aware of which parts generate huge plan. So we might
feel easier and flexible to just truncate to specific size rather than
applying conditions.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23076 (comment)> , or
mute the thread
<https://github.com/notifications/unsubscribe-auth/AAzVuuhxE9ejYmlbwLnamQkHK
NAGV8zpks5uwjxbgaJpZM4Ynzoz> .
<https://github.com/notifications/beacon/AAzVulccV4Ui9iSrDibkHyHFCy06Tz8Cks5
uwjxbgaJpZM4Ynzoz.gif>
<https://t.sidekickopen72.com/s1t/o/5/f18dQhb0S7kC8dDMPbW2n0x6l2B9gXrN7sKj6v
5KRN6W56jT_C3Lq__nW1pctGF2VxGHQf197v5Y04?si=7000000001068294&pi=d758f5d1-2c3
4-4ded-8e49-c782907b9060>
|
I would propose to make it more SQL specific, and put the config to the |
|
Closing this pull request in favor of #23169 |
The PR puts in a limit on the size of a debug string generated for a tree node. Helps to fix out of memory errors when large plans have huge debug strings. In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380. AN alternative solution was proposed in apache#23076, but that solution doesn't address all the cases that can cause a large query. This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with apache#22429, apache#23018 and apache#23039. Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI. - A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans. - When plans are truncated, "..." is printed to indicate that it isn't a full plan - A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. Unit tests were created for the new SizeLimitedWriter. Also a unit test for TreeNode was created that checks that a long plan is correctly truncated. Closes apache#23169 from DaveDeCaprio/text-plan-size. Lead-authored-by: Dave DeCaprio <[email protected]> Co-authored-by: David DeCaprio <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
Nested query plans can get extremely large (hundreds of megabytes).
What changes were proposed in this pull request?
The PR puts in a limit on the nesting depth of trees to be printed when writing a plan string.
How was this patch tested?
A new unit test in QueryExecutionSuite which creates a highly nested plan and then ensures that the printed plan is not too long.
Please review http://spark.apache.org/contributing.html before opening a pull request.