Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Dec 11, 2018

What changes were proposed in this pull request?

We have seen many cases when users make several subsequent calls to withColumn on a Dataset. This leads now to the generation of a lot of Project nodes on the top of the plan, with serious problem which can lead also to StackOverflowExceptions.

The PR improves the doc of withColumn, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs.

How was this patch tested?

NA

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @viirya

}.map { case (colName, col) => col.as(colName).named }

select(replacedAndExistingColumns ++ newColumns : _*)
CollapseProject(Project(replacedAndExistingColumns ++ newColumns, logicalPlan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the scope of this optimization? e.g. if the root node of this query is Project, update its project list to include withColumns, otherwise add a new Project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that. Imagine the case when all the columns depend on the previously added one: if we would do that, we would end up with an invalid plan. Or am I missing something?

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99969 has finished for PR 23285 at commit da2c82e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* the same names.
*/
private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = withPlan {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated on the JIRA ticket, the problem is deep query plan. I think we can have many ways to create such deep query plan, not only for withColumns. For example, you can call select many times to do that too. This change makes withColumns a special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I think this is a special case. I have seen many cases when withColumn is used in for loops: with this change such a pattern would be better supported.

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99975 has finished for PR 23285 at commit a40db10.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99981 has finished for PR 23285 at commit a40db10.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

the test failure shows a potential regression due to this patch I hadn't thought of. Collapsing the projects may avoid the usage of plans which are cached. Unfortunately, this cannot even be checked because the plan can be cached later.

@cloud-fan @viirya I don't have any idea how to address this potential regressions. So if you don't have suggestions, I'll close this PR. What do you think? Thanks.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Dec 13, 2018

IMHO, given that it's not easy to make chained withColumns works like one projection, we might want to consider exposing withColumns to public. If chains are independent, they could be changed to one of withColumns, and even some chains depend on others, the depth of projection could be greatly reduced.

@mgaido91
Copy link
Contributor Author

@HeartSaVioR I'd rather say that withColumns can be easily done as well using select. So I am not sure it is so useful to expose it. There is already a "workaround" for this problem, this PR just meant to avoid that issue since many people use it.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Dec 14, 2018

@mgaido91 Yeah, agreed there's a workaround (select), as well as it would be more ideal if we can collapse chain of withColumn properly. This is based on that it doesn't seem to achieve it, and we need to guide end users to take a workaround. End users using withColumn seem to feel more convenient than using select, then IMHO withColumns might become another convenient method for end users, alternative of withColumn.

Sure, no strong opinion, just a 2 cents.

@mgaido91
Copy link
Contributor Author

@HeartSaVioR I am just telling you what is my experience: I remember in one of my very first work with Spark that I used withColumn too in a for loop because it was easier/more convenient to work with one expression per time for me. When I realized that it was a bad idea for this reason, then I think that having a withColumns or using select doesn't make a big difference, as in any case you have to build your columns in advance and then pass them to the method. In this sense, I don't see the withColumns method being useful.

As an alternative, I'd propose here to check if there are several project on the top (we can define a threshold, eg. 50), when calling withColumn and in that case emit a warning saying something like: "Your plan contains a may Project nodes on top of each other. This usually happens if you are using withColumn in a for loop and you are adding many columns. Doing this is highly discouraged and can cause serious issues. Please use a single select and add all your new columns to it instead.". What do you think? cc @cloud-fan @viirya too.

@HeartSaVioR
Copy link
Contributor

It sounds good to me if we can provide warn message to guide replacing them with select. IMHO, the real issue is basically end users don't know chaining withColumn does harm on performance of query. Once we can guide it, it would be enough.

@mgaido91 mgaido91 changed the title [SPARK-26224][SQL] Avoid creating many project on subsequent calls to withColumn [SPARK-26224][SQL] Warn and advice the user when creating many project on subsequent calls to withColumn Dec 17, 2018
@mgaido91
Copy link
Contributor Author

I have updated the PR with the WARN approach. I can make the threshold configurable if we agree on this. WDYT @cloud-fan @HeartSaVioR @viirya ?

sparkSession.sessionState.conf.caseSensitiveAnalysis)
var numProjects = 0
var currPlan = logicalPlan
while (currPlan.isInstanceOf[Project] && numProjects < 50) {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. If we need to warn, +1 for adding new configuration for this value instead of 50 here and line 2164.

50 looks effective to detect this pattern, but can we have a higher value which is more practically related to the warning messages(performance degradation or OOM?)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I just wanted to be sure that we agree on the idea. Do you have hint/preferences for the name of the config?
I didn't want to introduce a high value in order not to have a high impact on perf for the loop to check this. What do you think?

Copy link
Contributor

@HeartSaVioR HeartSaVioR Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking the count of continuous projection and keep/reduce the count? I can't imagine end users to create more than (like) 20 times projection continuously without withColumn/drop/etc instead of select.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean @HeartSaVioR ? I don't think it is a good idea to add a counter in the Dataset class, which, moreover, should be carried over when creating a new Dataset, otherwise it is useless. It'd be an overkill for this IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Just re-read the code (while loop) and now seeing that this implementation already considers only continuous projections. Sorry about confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, thanks for your comment

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #100253 has finished for PR 23285 at commit e37c2d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101176 has finished for PR 23285 at commit 802cb9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"disable completely the check.")
.intConf
.checkValue(_ >= 0, "The max number of projects cannot be negative.")
.createWithDefault(50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50? Before we doing anything, could you first show the perf number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I used this code for running some tests on my local machine:

    def time[R](block: => R): R = {
      val t0 = System.nanoTime()
      val result = block    // call-by-name
      val t1 = System.nanoTime()
      println("Elapsed time: " + (t1 - t0) + " ns")
      result
    }
    def withNCol(nCol: Int): org.apache.spark.sql.DataFrame = {
      var dfOut = spark.range(1).toDF
      (1 to nCol).foreach{ i => dfOut = dfOut.withColumn(s"c$i", lit(i))}
      dfOut
    }
    time { withNCol(10).queryExecution.sparkPlan }
Elapsed time: 81422845 ns
    time { withNCol(50).queryExecution.sparkPlan }
Elapsed time: 252820355 ns
    time { withNCol(100).queryExecution.sparkPlan }
Elapsed time: 568628677 ns
    time { withNCol(200).queryExecution.sparkPlan }
Elapsed time: 1150096346 ns
    time { withNCol(500).queryExecution.sparkPlan }
Elapsed time: 8255914278 ns
    time { withNCol(1000).queryExecution.sparkPlan }
Elapsed time: 33032475637 ns
    time { withNCol(2000).queryExecution.sparkPlan }
Elapsed time: 254183356160 ns

Maybe a reasonable number is 200?

@HyukjinKwon
Copy link
Member

How about we simply leave a note in the doc that says for instance multiple calls of this causes deeply nested plan and workaround is, for instance, select multiple columns? I think that's going to reach the goal here. We can just document and advise.

Changes for instance of configuration, benchmarking to find out the most appropriate number, or finding out a general fix the root cause looks quite over kill considering the goal is just to let users know about the limitation and workaround.

@mgaido91
Copy link
Contributor Author

@HyukjinKwon I think a warning is more effective than just a note in the doc. Since this is a very bad pattern to use and quite a widespread one, I think we should do as much as possible to avoid that users do it.

@HyukjinKwon
Copy link
Member

I agree that this is a bad pattern and rather common mistake(?) that users do time to time. Warning can be more effective.

However, I was wondering if it's worth enough to make the current change considering that it's going to be more complicated and cause a bit of overhead to maintain this code. For instance, if we happened to improve the deeply nested plan problem, we should find another number fo set as default. Also, configurations for logging does look an overkill ..

We could go for documentation first and consider the current fix later when users keep making this pattern.

@mgaido91
Copy link
Contributor Author

@HyukjinKwon I don't think documentation is very effective. Let me ask for others opinion on this: @dongjoon-hyun @gatorsmile what do you think?

@mgaido91
Copy link
Contributor Author

This seemed to got a bit stale. I think there are 2 approaches possible:

  • using doc as suggested by @HyukjinKwon ;
  • emitting a WARN as I think is more effective.

Can we find agreement on which of these 2 possible paths follow? Thanks.

@HyukjinKwon
Copy link
Member

Let's do documentation first, and then warn later if people still face this issue.

@mgaido91
Copy link
Contributor Author

ok, thanks @HyukjinKwon

@mgaido91 mgaido91 changed the title [SPARK-26224][SQL] Warn and advice the user when creating many project on subsequent calls to withColumn [SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn Mar 28, 2019
@SparkQA
Copy link

SparkQA commented Mar 28, 2019

Test build #104048 has finished for PR 23285 at commit 5d1fc00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Let's consider the actual fix later when we see users still complain.

@HyukjinKwon
Copy link
Member

Ahh ... actually we should fix R and Python side too. Let me make a quick followup.

HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Apr 2, 2019
…s in withColumn at SparkR and PySpark as well

## What changes were proposed in this pull request?

This is a followup of apache#23285. This PR adds the notes into PySpark and SparkR documentation as well.

While I am here, I revised the doc a bit to make it sound a bit more neutral

## How was this patch tested?

Manually built the doc and verified.

Closes apache#24272 from HyukjinKwon/SPARK-26224.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants