-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24983][SQL] limit number of leaf expressions in a single project when collapse project to prevent driver oom #29094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -121,6 +121,17 @@ class CollapseProjectSuite extends PlanTest { | |||||||||||||||||||
| comparePlans(optimized, correctAnswer) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| test("do not collapse project if number of leave expressions would be too big") { | ||||||||||||||||||||
| var query: LogicalPlan = testRelation | ||||||||||||||||||||
| for( _ <- 1 to 10) { | ||||||||||||||||||||
| // after n iterations the number of leaf expressions will be 2^{n+1} | ||||||||||||||||||||
| // => after 10 iterations we would end up with more than 1000 leaf expressions | ||||||||||||||||||||
| query = query.select(('a + 'b).as('a), ('a - 'b).as('b)) | ||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same issue can happens in spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala Lines 127 to 135 in 4da93b0
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for reminding, my test case is : var query = spark.range(5).withColumn("new_column", 'id + 5 as "plus5").toDF("a","b") And it works for both Optimized Logical Plan and Physical Plan. I notice the difference is that my data type is bigint: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint], it seems the project will not collapse I test the case above and the problem exist for Physical Plan, so we also add a check for that? |
||||||||||||||||||||
| } | ||||||||||||||||||||
| val projects = Optimize.execute(query.analyze).collect { case p: Project => p } | ||||||||||||||||||||
| assert(projects.size === 2) // should be collapsed to two projects | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| test("preserve top-level alias metadata while collapsing projects") { | ||||||||||||||||||||
| def hasMetadata(logicalPlan: LogicalPlan): Boolean = { | ||||||||||||||||||||
| logicalPlan.asInstanceOf[Project].projectList.exists(_.metadata.contains("key")) | ||||||||||||||||||||
|
|
||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you can simplify it like this?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can simplify the condition check logic, do you suggest to add a new SQLConf instead of the hard limit? And for the case statement there is already a condition check called 'haveCommonNonDeterministicOutput', so I put them together. Also the same for 'case p @ Project(_, agg: Aggregate)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, if we add this logic, I think we need a conf for that.