[SPARK-18857][SQL] Don't use `Iterator.duplicate` for `incrementalCollect` in Thrift Server #16440

dongjoon-hyun · 2016-12-30T12:55:54Z

What changes were proposed in this pull request?

To support FETCH_FIRST, SPARK-16563 used Scala Iterator.duplicate. However,
Scala Iterator.duplicate uses a queue to buffer all items between both iterators,
this causes GC and hangs for queries with large number of rows. We should not use this,
especially for spark.sql.thriftServer.incrementalCollect.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

How was this patch tested?

Pass the existing tests.

…lect` in Thrift Server. Since Scala `Iterator.duplicate` uses a queue to buffer all items between both iterators, this causes GC and hangs. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`. https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

SparkQA · 2016-12-30T13:26:08Z

Test build #70751 has finished for PR 16440 at commit 125e79c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-31T11:43:54Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

    }
  }

+  private def useIncrementalCollect: Boolean = {


Does this need to be a def? will it ever change?

Yep. When we uses beeline, we can control this like the following.

0: jdbc:hive2://localhost:10000> set spark.sql.thriftServer.incrementalCollect=false; +--------------------------------------------+--------+--+ | key | value | +--------------------------------------------+--------+--+ | spark.sql.thriftServer.incrementalCollect | false | +--------------------------------------------+--------+--+ 1 row selected (0.015 seconds) 0: jdbc:hive2://localhost:10000> select * from t; +----+--+ | a | +----+--+ +----+--+ No rows selected (0.054 seconds)

srowen · 2016-12-31T11:53:39Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

  with Logging {

  private var result: DataFrame = _
+  private var resultList: Option[Array[org.apache.spark.sql.Row]] = _


Write SparkRow for consistency? and init to None explicitly?

Thank you for review, @srowen .
Sure!

srowen · 2016-12-31T12:05:31Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

+        resultList = None
+        result.toLocalIterator.asScala
+      } else {
+        if (resultList.isEmpty) {


I agree that this makes the implicit buffering implicit. So, if an iterator is duplicated into A and B, and all of A is consumed, then B will internally buffer everything from A so it can be replayed? and in our case, we know that A will be entirely consumed? then these are basically the same, yes.

But, does that solve the problem? this now always stores the whole result set locally. Is this avoiding a second whole copy of it?

What if you always just return result.collect().iterator here -- the problem is the re-collecting the result every time?

Yes. The following happens with iterator.duplicate.

So, if an iterator is duplicated into A and B, and all of A is consumed, then B will internally buffer everything from A so it can be replayed?

And, the whole result storing happens line 122 and line 245-246 for spark.sql.thriftServer.incrementalCollect=false only.

resultList = Some(result.collect()) resultList.get.iterator

I suppose I'm asking, why is this an improvement? because in the new version, you also buffer the whole result into memory locally.

Correct. There are two cases and this PR targets incrementalCollect=true mainly.

you also buffer the whole result into memory locally.

First of all, before SPARK-16563, FETCH_FIRST is not supported correctly because iterator can be traversed once.

Case 1) incrementalCollect=false
Creating a whole result in a memory once by result.collect was the the original Spark way before SPARK-16563.
If we can create a whole result once during the query processing, I think we can keep that for FETCH_FIRST with less side effect.
So, I keep them. If this is not allowed, we have to go Case 2.

Case 2) incrementalCollect=true
In this case, by definition, we cannot create the whole result set in a memory at any time during the query processing. There is no way to find the first row with iterator. result.toLocalIterator.asScala should be called whenever FETCH_FIRST is used.

OK I believe I get it now. I see your approach and it makes sense.
The only real change here is that you hold on to a reference to the whole data set here rather than collect() it into memory. Maybe that's the right thing to do but that's the only thing I'm wondering about. Previously it seems like it would collect() each time anyway?

Just wondering if that's actually simpler, to avoid keeping a reference to the whole data set, or whether that defeats a purpose.

collect() is still intended to be called once logically. The following is the reason why there exists two collect().

When useIncrementalCollect=false, collect() is called at line 244 once and resultList will not be None.

However, if users executes a query with useIncrementalCollect=true and they changes their mind to turn off as useIncrementalCollect=false. The next getNextRowSet(FetchOrientation.FETCH_FIRST) should check resultList and fill that by calling collect() once in line 123.

SparkQA · 2017-01-01T07:40:12Z

Test build #70771 has finished for PR 16440 at commit 00bb52d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

OK, I think that makes sense. @alicegugu @ericl @rxin do you have any comments?

dongjoon-hyun · 2017-01-05T17:39:03Z

Thank you again, @srowen .

Hi, @alicegugu , @ericl , @rxin . Could you give me some opinion about this?

ericl

Overall looks good, but I think some comments would be helpful.

ericl · 2017-01-05T18:44:32Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

  }

+  private def useIncrementalCollect: Boolean = {
+    sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", "false").toBoolean


Can we document this configuration flag in SQLConf?

Oh, I see. I'll register this configuration into SQLConf explicitly.

ericl · 2017-01-05T18:48:41Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

  with Logging {

  private var result: DataFrame = _
+  private var resultList: Option[Array[SparkRow]] = _


Can we document these two fields, e.g. we cache the returned rows in resultList in case the user wants to use FETCH_FIRST. This is only used when incremental collect is set to false, otherwise FETCH_FIRST will trigger re-execution.

Sure! Thank you for review, @ericl .

dongjoon-hyun · 2017-01-05T20:20:49Z

The PR is updated like the followings.

Add SQLConf.THRIFTSERVER_INCREMENTAL_COLLECT.
Remove private def useIncrementalCollect.
Add description for resultList variable.

SparkQA · 2017-01-05T22:34:18Z

Test build #70942 has finished for PR 16440 at commit fb4ee89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-01-06T11:14:55Z

Hi, @ericl and @srowen .
If there is something to do more, please let me know.
Thank you always.

srowen · 2017-01-06T17:44:35Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .stringConf
    .createOptional

+  val THRIFTSERVER_INCREMENTAL_COLLECT =


I think this makes this option more "public"; I see some other options here marked as .internal(). I don't know whether this is meant to be further exposed. It might be more conservative to make it internal for the moment? but yes seems sensible to make a config key constant like this.

Oh, I see. I'll make that internal. According to the current usage. internal seems to be better.

SparkQA · 2017-01-06T20:30:10Z

Test build #70989 has finished for PR 16440 at commit e66b165.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-01-07T16:11:21Z

Thank you for approving, @srowen .

dongjoon-hyun · 2017-01-10T06:53:42Z

Hi, @srowen .
Could you merge this PR?

dongjoon-hyun · 2017-01-10T15:00:36Z

Thank you for merging, @srowen !

dongjoon-hyun · 2017-01-10T17:49:57Z

Hi, @srowen . May I create a backport for 2.0 and 2.1 ?
#14218 was merged into branch-2.0 and branch-2.1.

…lect` in Thrift Server ## What changes were proposed in this pull request? To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However, Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**, this causes GC and hangs for queries with large number of rows. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`. https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300 ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <[email protected]> Closes #16440 from dongjoon-hyun/SPARK-18857. (cherry picked from commit a2c6adc) Signed-off-by: Sean Owen <[email protected]>

srowen · 2017-01-12T10:45:53Z

Merged to 2.1/2.0 as well. I agree, it's a clean fix for a non-trivial problem.

dongjoon-hyun · 2017-01-12T16:41:18Z

Thank you so much, @srowen !

…lect` in Thrift Server ## What changes were proposed in this pull request? To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However, Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**, this causes GC and hangs for queries with large number of rows. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`. https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300 ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <[email protected]> Closes apache#16440 from dongjoon-hyun/SPARK-18857.

srowen requested changes Dec 31, 2016

View reviewed changes

Use SparkRow instead of Row.

00bb52d

srowen reviewed Jan 4, 2017

View reviewed changes

ericl reviewed Jan 5, 2017

View reviewed changes

Create SQLConf.THRIFTSERVER_INCREMENTAL_COLLECT and add docs.

fb4ee89

srowen reviewed Jan 6, 2017

View reviewed changes

Make the config internal.

e66b165

srowen approved these changes Jan 7, 2017

View reviewed changes

asfgit closed this in a2c6adc Jan 10, 2017

dongjoon-hyun deleted the SPARK-18857 branch January 7, 2019 07:03

[SPARK-18857][SQL] Don't use Iterator.duplicate for incrementalCollect in Thrift Server #16440

[SPARK-18857][SQL] Don't use Iterator.duplicate for incrementalCollect in Thrift Server #16440

Uh oh!

Conversation

dongjoon-hyun commented Dec 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 1, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 5, 2017

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Jan 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 5, 2017

Uh oh!

SparkQA commented Jan 5, 2017

Uh oh!

dongjoon-hyun commented Jan 6, 2017

Uh oh!

srowen Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2017

Uh oh!

dongjoon-hyun commented Jan 7, 2017

Uh oh!

dongjoon-hyun commented Jan 10, 2017

Uh oh!

dongjoon-hyun commented Jan 10, 2017

Uh oh!

dongjoon-hyun commented Jan 10, 2017

Uh oh!

srowen commented Jan 12, 2017

Uh oh!

dongjoon-hyun commented Jan 12, 2017

Uh oh!

[SPARK-18857][SQL] Don't use `Iterator.duplicate` for `incrementalCollect` in Thrift Server #16440

[SPARK-18857][SQL] Don't use `Iterator.duplicate` for `incrementalCollect` in Thrift Server #16440

dongjoon-hyun Jan 1, 2017 •

edited

Loading

ericl Jan 5, 2017 •

edited

Loading

srowen Jan 6, 2017 •

edited

Loading

dongjoon-hyun Jan 6, 2017 •

edited

Loading