[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn #23285

mgaido91 · 2018-12-11T11:27:40Z

What changes were proposed in this pull request?

We have seen many cases when users make several subsequent calls to withColumn on a Dataset. This leads now to the generation of a lot of Project nodes on the top of the plan, with serious problem which can lead also to StackOverflowExceptions.

The PR improves the doc of withColumn, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs.

How was this patch tested?

NA

… withColumn

mgaido91 · 2018-12-11T11:28:01Z

cc @cloud-fan @viirya

cloud-fan · 2018-12-11T11:39:48Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    }.map { case (colName, col) => col.as(colName).named }

-    select(replacedAndExistingColumns ++ newColumns : _*)
+    CollapseProject(Project(replacedAndExistingColumns ++ newColumns, logicalPlan))


Can we reduce the scope of this optimization? e.g. if the root node of this query is Project, update its project list to include withColumns, otherwise add a new Project.

I don't think we can do that. Imagine the case when all the columns depend on the previously added one: if we would do that, we would end up with an invalid plan. Or am I missing something?

SparkQA · 2018-12-11T12:14:39Z

Test build #99969 has finished for PR 23285 at commit da2c82e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-12-11T12:38:27Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   * the same names.
   */
-  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = withPlan {


As stated on the JIRA ticket, the problem is deep query plan. I think we can have many ways to create such deep query plan, not only for withColumns. For example, you can call select many times to do that too. This change makes withColumns a special case.

yes, but I think this is a special case. I have seen many cases when withColumn is used in for loops: with this change such a pattern would be better supported.

SparkQA · 2018-12-11T14:53:09Z

Test build #99975 has finished for PR 23285 at commit a40db10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-11T14:54:44Z

retest this please

SparkQA · 2018-12-11T17:33:01Z

Test build #99981 has finished for PR 23285 at commit a40db10.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-12T10:54:53Z

the test failure shows a potential regression due to this patch I hadn't thought of. Collapsing the projects may avoid the usage of plans which are cached. Unfortunately, this cannot even be checked because the plan can be cached later.

@cloud-fan @viirya I don't have any idea how to address this potential regressions. So if you don't have suggestions, I'll close this PR. What do you think? Thanks.

HeartSaVioR · 2018-12-13T01:00:14Z

IMHO, given that it's not easy to make chained withColumns works like one projection, we might want to consider exposing withColumns to public. If chains are independent, they could be changed to one of withColumns, and even some chains depend on others, the depth of projection could be greatly reduced.

mgaido91 · 2018-12-13T09:05:43Z

@HeartSaVioR I'd rather say that withColumns can be easily done as well using select. So I am not sure it is so useful to expose it. There is already a "workaround" for this problem, this PR just meant to avoid that issue since many people use it.

HeartSaVioR · 2018-12-14T06:11:31Z

@mgaido91 Yeah, agreed there's a workaround (select), as well as it would be more ideal if we can collapse chain of withColumn properly. This is based on that it doesn't seem to achieve it, and we need to guide end users to take a workaround. End users using withColumn seem to feel more convenient than using select, then IMHO withColumns might become another convenient method for end users, alternative of withColumn.

Sure, no strong opinion, just a 2 cents.

mgaido91 · 2018-12-14T09:32:21Z

@HeartSaVioR I am just telling you what is my experience: I remember in one of my very first work with Spark that I used withColumn too in a for loop because it was easier/more convenient to work with one expression per time for me. When I realized that it was a bad idea for this reason, then I think that having a withColumns or using select doesn't make a big difference, as in any case you have to build your columns in advance and then pass them to the method. In this sense, I don't see the withColumns method being useful.

As an alternative, I'd propose here to check if there are several project on the top (we can define a threshold, eg. 50), when calling withColumn and in that case emit a warning saying something like: "Your plan contains a may Project nodes on top of each other. This usually happens if you are using withColumn in a for loop and you are adding many columns. Doing this is highly discouraged and can cause serious issues. Please use a single select and add all your new columns to it instead.". What do you think? cc @cloud-fan @viirya too.

HeartSaVioR · 2018-12-14T13:57:54Z

It sounds good to me if we can provide warn message to guide replacing them with select. IMHO, the real issue is basically end users don't know chaining withColumn does harm on performance of query. Once we can guide it, it would be enough.

This reverts commit a40db10.

…calls to withColumn" This reverts commit da2c82e.

mgaido91 · 2018-12-17T16:48:46Z

I have updated the PR with the WARN approach. I can make the threshold configurable if we agree on this. WDYT @cloud-fan @HeartSaVioR @viirya ?

dongjoon-hyun · 2018-12-17T19:35:29Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

      sparkSession.sessionState.conf.caseSensitiveAnalysis)
+    var numProjects = 0
+    var currPlan = logicalPlan
+    while (currPlan.isInstanceOf[Project] && numProjects < 50) {


Yep. If we need to warn, +1 for adding new configuration for this value instead of 50 here and line 2164.

50 looks effective to detect this pattern, but can we have a higher value which is more practically related to the warning messages(performance degradation or OOM?)?

yes, I just wanted to be sure that we agree on the idea. Do you have hint/preferences for the name of the config?
I didn't want to introduce a high value in order not to have a high impact on perf for the loop to check this. What do you think?

How about checking the count of continuous projection and keep/reduce the count? I can't imagine end users to create more than (like) 20 times projection continuously without withColumn/drop/etc instead of select.

What do you mean @HeartSaVioR ? I don't think it is a good idea to add a counter in the Dataset class, which, moreover, should be carried over when creating a new Dataset, otherwise it is useless. It'd be an overkill for this IMO.

My bad. Just re-read the code (while loop) and now seeing that this implementation already considers only continuous projections. Sorry about confusion.

np, thanks for your comment

SparkQA · 2018-12-17T20:38:55Z

Test build #100253 has finished for PR 23285 at commit e37c2d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-14T13:35:44Z

Test build #101176 has finished for PR 23285 at commit 802cb9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

LGTM

gatorsmile · 2019-01-14T17:29:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "disable completely the check.")
+    .intConf
+    .checkValue(_ >= 0, "The max number of projects cannot be negative.")
+    .createWithDefault(50)


50? Before we doing anything, could you first show the perf number?

sure, I used this code for running some tests on my local machine:

def time[R](block: => R): R = { val t0 = System.nanoTime() val result = block // call-by-name val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0) + " ns") result } def withNCol(nCol: Int): org.apache.spark.sql.DataFrame = { var dfOut = spark.range(1).toDF (1 to nCol).foreach{ i => dfOut = dfOut.withColumn(s"c$i", lit(i))} dfOut } time { withNCol(10).queryExecution.sparkPlan } Elapsed time: 81422845 ns time { withNCol(50).queryExecution.sparkPlan } Elapsed time: 252820355 ns time { withNCol(100).queryExecution.sparkPlan } Elapsed time: 568628677 ns time { withNCol(200).queryExecution.sparkPlan } Elapsed time: 1150096346 ns time { withNCol(500).queryExecution.sparkPlan } Elapsed time: 8255914278 ns time { withNCol(1000).queryExecution.sparkPlan } Elapsed time: 33032475637 ns time { withNCol(2000).queryExecution.sparkPlan } Elapsed time: 254183356160 ns

Maybe a reasonable number is 200?

HyukjinKwon · 2019-01-15T04:33:25Z

How about we simply leave a note in the doc that says for instance multiple calls of this causes deeply nested plan and workaround is, for instance, select multiple columns? I think that's going to reach the goal here. We can just document and advise.

Changes for instance of configuration, benchmarking to find out the most appropriate number, or finding out a general fix the root cause looks quite over kill considering the goal is just to let users know about the limitation and workaround.

mgaido91 · 2019-01-15T14:17:52Z

@HyukjinKwon I think a warning is more effective than just a note in the doc. Since this is a very bad pattern to use and quite a widespread one, I think we should do as much as possible to avoid that users do it.

HyukjinKwon · 2019-01-16T03:43:21Z

I agree that this is a bad pattern and rather common mistake(?) that users do time to time. Warning can be more effective.

However, I was wondering if it's worth enough to make the current change considering that it's going to be more complicated and cause a bit of overhead to maintain this code. For instance, if we happened to improve the deeply nested plan problem, we should find another number fo set as default. Also, configurations for logging does look an overkill ..

We could go for documentation first and consider the current fix later when users keep making this pattern.

mgaido91 · 2019-01-16T08:10:17Z

@HyukjinKwon I don't think documentation is very effective. Let me ask for others opinion on this: @dongjoon-hyun @gatorsmile what do you think?

mgaido91 · 2019-03-28T10:16:43Z

This seemed to got a bit stale. I think there are 2 approaches possible:

using doc as suggested by @HyukjinKwon ;
emitting a WARN as I think is more effective.

Can we find agreement on which of these 2 possible paths follow? Thanks.

HyukjinKwon · 2019-03-28T11:43:32Z

Let's do documentation first, and then warn later if people still face this issue.

mgaido91 · 2019-03-28T12:51:50Z

ok, thanks @HyukjinKwon

SparkQA · 2019-03-28T17:18:32Z

Test build #104048 has finished for PR 23285 at commit 5d1fc00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-04-02T05:12:18Z

Merged to master.

Let's consider the actual fix later when we see users still complain.

HyukjinKwon · 2019-04-02T05:15:36Z

Ahh ... actually we should fix R and Python side too. Let me make a quick followup.

…s in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of apache#23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes apache#24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

[SPARK-26224][SQL] Avoid creating many project on subsequent calls to…

da2c82e

… withColumn

cloud-fan reviewed Dec 11, 2018

View reviewed changes

viirya reviewed Dec 11, 2018

View reviewed changes

fix UT failures

a40db10

mgaido91 added 3 commits December 17, 2018 17:03

Revert "fix UT failures"

fa25e2e

This reverts commit a40db10.

Revert "[SPARK-26224][SQL] Avoid creating many project on subsequent …

5162b13

…calls to withColumn" This reverts commit da2c82e.

warning for too many withColumn

e37c2d6

mgaido91 changed the title ~~[SPARK-26224][SQL] Avoid creating many project on subsequent calls to withColumn~~ [SPARK-26224][SQL] Warn and advice the user when creating many project on subsequent calls to withColumn Dec 17, 2018

dongjoon-hyun reviewed Dec 17, 2018

View reviewed changes

add conf

802cb9e

HeartSaVioR approved these changes Jan 14, 2019

View reviewed changes

gatorsmile reviewed Jan 14, 2019

View reviewed changes

address comment

5d1fc00

mgaido91 changed the title ~~[SPARK-26224][SQL] Warn and advice the user when creating many project on subsequent calls to withColumn~~ [SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn Mar 28, 2019

HyukjinKwon closed this in 0b150f8 Apr 2, 2019

HyukjinKwon mentioned this pull request Apr 2, 2019

[SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well #24272

Closed

[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn #23285

[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn #23285

Uh oh!

Conversation

mgaido91 commented Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mgaido91 commented Dec 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 11, 2018

Uh oh!

mgaido91 commented Dec 11, 2018

Uh oh!

SparkQA commented Dec 11, 2018

Uh oh!

mgaido91 commented Dec 12, 2018

Uh oh!

HeartSaVioR commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Dec 13, 2018

Uh oh!

HeartSaVioR commented Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Dec 14, 2018

Uh oh!

HeartSaVioR commented Dec 14, 2018

Uh oh!

mgaido91 commented Dec 17, 2018

Uh oh!

dongjoon-hyun Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2018

Uh oh!

SparkQA commented Jan 14, 2019

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 15, 2019

Uh oh!

mgaido91 commented Jan 15, 2019

Uh oh!

HyukjinKwon commented Jan 16, 2019

Uh oh!

mgaido91 commented Jan 16, 2019

Uh oh!

mgaido91 commented Mar 28, 2019

Uh oh!

mgaido91 commented Dec 11, 2018 •

edited

Loading

HeartSaVioR commented Dec 13, 2018 •

edited

Loading

HeartSaVioR commented Dec 14, 2018 •

edited

Loading

dongjoon-hyun Dec 17, 2018 •

edited

Loading

HeartSaVioR Dec 17, 2018 •

edited

Loading