-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn #23285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| }.map { case (colName, col) => col.as(colName).named } | ||
|
|
||
| select(replacedAndExistingColumns ++ newColumns : _*) | ||
| CollapseProject(Project(replacedAndExistingColumns ++ newColumns, logicalPlan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reduce the scope of this optimization? e.g. if the root node of this query is Project, update its project list to include withColumns, otherwise add a new Project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do that. Imagine the case when all the columns depend on the previously added one: if we would do that, we would end up with an invalid plan. Or am I missing something?
|
Test build #99969 has finished for PR 23285 at commit
|
| * the same names. | ||
| */ | ||
| private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = { | ||
| private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = withPlan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated on the JIRA ticket, the problem is deep query plan. I think we can have many ways to create such deep query plan, not only for withColumns. For example, you can call select many times to do that too. This change makes withColumns a special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but I think this is a special case. I have seen many cases when withColumn is used in for loops: with this change such a pattern would be better supported.
|
Test build #99975 has finished for PR 23285 at commit
|
|
retest this please |
|
Test build #99981 has finished for PR 23285 at commit
|
|
the test failure shows a potential regression due to this patch I hadn't thought of. Collapsing the projects may avoid the usage of plans which are cached. Unfortunately, this cannot even be checked because the plan can be cached later. @cloud-fan @viirya I don't have any idea how to address this potential regressions. So if you don't have suggestions, I'll close this PR. What do you think? Thanks. |
|
IMHO, given that it's not easy to make chained |
|
@HeartSaVioR I'd rather say that |
|
@mgaido91 Yeah, agreed there's a workaround ( Sure, no strong opinion, just a 2 cents. |
|
@HeartSaVioR I am just telling you what is my experience: I remember in one of my very first work with Spark that I used As an alternative, I'd propose here to check if there are several project on the top (we can define a threshold, eg. 50), when calling |
|
It sounds good to me if we can provide warn message to guide replacing them with select. IMHO, the real issue is basically end users don't know chaining withColumn does harm on performance of query. Once we can guide it, it would be enough. |
|
I have updated the PR with the WARN approach. I can make the threshold configurable if we agree on this. WDYT @cloud-fan @HeartSaVioR @viirya ? |
| sparkSession.sessionState.conf.caseSensitiveAnalysis) | ||
| var numProjects = 0 | ||
| var currPlan = logicalPlan | ||
| while (currPlan.isInstanceOf[Project] && numProjects < 50) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. If we need to warn, +1 for adding new configuration for this value instead of 50 here and line 2164.
50 looks effective to detect this pattern, but can we have a higher value which is more practically related to the warning messages(performance degradation or OOM?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I just wanted to be sure that we agree on the idea. Do you have hint/preferences for the name of the config?
I didn't want to introduce a high value in order not to have a high impact on perf for the loop to check this. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking the count of continuous projection and keep/reduce the count? I can't imagine end users to create more than (like) 20 times projection continuously without withColumn/drop/etc instead of select.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean @HeartSaVioR ? I don't think it is a good idea to add a counter in the Dataset class, which, moreover, should be carried over when creating a new Dataset, otherwise it is useless. It'd be an overkill for this IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Just re-read the code (while loop) and now seeing that this implementation already considers only continuous projections. Sorry about confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, thanks for your comment
|
Test build #100253 has finished for PR 23285 at commit
|
|
Test build #101176 has finished for PR 23285 at commit
|
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| "disable completely the check.") | ||
| .intConf | ||
| .checkValue(_ >= 0, "The max number of projects cannot be negative.") | ||
| .createWithDefault(50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50? Before we doing anything, could you first show the perf number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I used this code for running some tests on my local machine:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + " ns")
result
}
def withNCol(nCol: Int): org.apache.spark.sql.DataFrame = {
var dfOut = spark.range(1).toDF
(1 to nCol).foreach{ i => dfOut = dfOut.withColumn(s"c$i", lit(i))}
dfOut
}
time { withNCol(10).queryExecution.sparkPlan }
Elapsed time: 81422845 ns
time { withNCol(50).queryExecution.sparkPlan }
Elapsed time: 252820355 ns
time { withNCol(100).queryExecution.sparkPlan }
Elapsed time: 568628677 ns
time { withNCol(200).queryExecution.sparkPlan }
Elapsed time: 1150096346 ns
time { withNCol(500).queryExecution.sparkPlan }
Elapsed time: 8255914278 ns
time { withNCol(1000).queryExecution.sparkPlan }
Elapsed time: 33032475637 ns
time { withNCol(2000).queryExecution.sparkPlan }
Elapsed time: 254183356160 ns
Maybe a reasonable number is 200?
|
How about we simply leave a Changes for instance of configuration, benchmarking to find out the most appropriate number, or finding out a general fix the root cause looks quite over kill considering the goal is just to let users know about the limitation and workaround. |
|
@HyukjinKwon I think a warning is more effective than just a note in the doc. Since this is a very bad pattern to use and quite a widespread one, I think we should do as much as possible to avoid that users do it. |
|
I agree that this is a bad pattern and rather common mistake(?) that users do time to time. Warning can be more effective. However, I was wondering if it's worth enough to make the current change considering that it's going to be more complicated and cause a bit of overhead to maintain this code. For instance, if we happened to improve the deeply nested plan problem, we should find another number fo set as default. Also, configurations for logging does look an overkill .. We could go for documentation first and consider the current fix later when users keep making this pattern. |
|
@HyukjinKwon I don't think documentation is very effective. Let me ask for others opinion on this: @dongjoon-hyun @gatorsmile what do you think? |
|
This seemed to got a bit stale. I think there are 2 approaches possible:
Can we find agreement on which of these 2 possible paths follow? Thanks. |
|
Let's do documentation first, and then warn later if people still face this issue. |
|
ok, thanks @HyukjinKwon |
|
Test build #104048 has finished for PR 23285 at commit
|
|
Merged to master. Let's consider the actual fix later when we see users still complain. |
|
Ahh ... actually we should fix R and Python side too. Let me make a quick followup. |
…s in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of apache#23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes apache#24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
We have seen many cases when users make several subsequent calls to
withColumnon a Dataset. This leads now to the generation of a lot ofProjectnodes on the top of the plan, with serious problem which can lead also toStackOverflowExceptions.The PR improves the doc of
withColumn, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs.How was this patch tested?
NA