Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

To support FETCH_FIRST, SPARK-16563 used Scala Iterator.duplicate. However,
Scala Iterator.duplicate uses a queue to buffer all items between both iterators,
this causes GC and hangs for queries with large number of rows. We should not use this,
especially for spark.sql.thriftServer.incrementalCollect.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

How was this patch tested?

Pass the existing tests.

…lect` in Thrift Server.

Since Scala `Iterator.duplicate` uses a queue to buffer all items between both iterators,
this causes GC and hangs. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300
@SparkQA
Copy link

SparkQA commented Dec 30, 2016

Test build #70751 has finished for PR 16440 at commit 125e79c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

private def useIncrementalCollect: Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a def? will it ever change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. When we uses beeline, we can control this like the following.

0: jdbc:hive2://localhost:10000> set spark.sql.thriftServer.incrementalCollect=false;
+--------------------------------------------+--------+--+
|                    key                     | value  |
+--------------------------------------------+--------+--+
| spark.sql.thriftServer.incrementalCollect  | false  |
+--------------------------------------------+--------+--+
1 row selected (0.015 seconds)

0: jdbc:hive2://localhost:10000> select * from t;
+----+--+
| a  |
+----+--+
+----+--+
No rows selected (0.054 seconds)

with Logging {

private var result: DataFrame = _
private var resultList: Option[Array[org.apache.spark.sql.Row]] = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write SparkRow for consistency? and init to None explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @srowen .
Sure!

resultList = None
result.toLocalIterator.asScala
} else {
if (resultList.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this makes the implicit buffering implicit. So, if an iterator is duplicated into A and B, and all of A is consumed, then B will internally buffer everything from A so it can be replayed? and in our case, we know that A will be entirely consumed? then these are basically the same, yes.

But, does that solve the problem? this now always stores the whole result set locally. Is this avoiding a second whole copy of it?

What if you always just return result.collect().iterator here -- the problem is the re-collecting the result every time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The following happens with iterator.duplicate.

So, if an iterator is duplicated into A and B, and all of A is consumed, then B will internally buffer everything from A so it can be replayed?

And, the whole result storing happens line 122 and line 245-246 for spark.sql.thriftServer.incrementalCollect=false only.

resultList = Some(result.collect())
resultList.get.iterator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'm asking, why is this an improvement? because in the new version, you also buffer the whole result into memory locally.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There are two cases and this PR targets incrementalCollect=true mainly.

you also buffer the whole result into memory locally.

First of all, before SPARK-16563, FETCH_FIRST is not supported correctly because iterator can be traversed once.

  • Case 1) incrementalCollect=false
    Creating a whole result in a memory once by result.collect was the the original Spark way before SPARK-16563.
    If we can create a whole result once during the query processing, I think we can keep that for FETCH_FIRST with less side effect.
    So, I keep them. If this is not allowed, we have to go Case 2.

  • Case 2) incrementalCollect=true
    In this case, by definition, we cannot create the whole result set in a memory at any time during the query processing. There is no way to find the first row with iterator. result.toLocalIterator.asScala should be called whenever FETCH_FIRST is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I believe I get it now. I see your approach and it makes sense.
The only real change here is that you hold on to a reference to the whole data set here rather than collect() it into memory. Maybe that's the right thing to do but that's the only thing I'm wondering about. Previously it seems like it would collect() each time anyway?

Just wondering if that's actually simpler, to avoid keeping a reference to the whole data set, or whether that defeats a purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect() is still intended to be called once logically. The following is the reason why there exists two collect().

When useIncrementalCollect=false, collect() is called at line 244 once and resultList will not be None.

However, if users executes a query with useIncrementalCollect=true and they changes their mind to turn off as useIncrementalCollect=false. The next getNextRowSet(FetchOrientation.FETCH_FIRST) should check resultList and fill that by calling collect() once in line 123.

@SparkQA
Copy link

SparkQA commented Jan 1, 2017

Test build #70771 has finished for PR 16440 at commit 00bb52d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think that makes sense. @alicegugu @ericl @rxin do you have any comments?

@dongjoon-hyun
Copy link
Member Author

Thank you again, @srowen .

Hi, @alicegugu , @ericl , @rxin . Could you give me some opinion about this?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I think some comments would be helpful.

}

private def useIncrementalCollect: Boolean = {
sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", "false").toBoolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this configuration flag in SQLConf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I'll register this configuration into SQLConf explicitly.

with Logging {

private var result: DataFrame = _
private var resultList: Option[Array[SparkRow]] = _
Copy link
Contributor

@ericl ericl Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document these two fields, e.g. we cache the returned rows in resultList in case the user wants to use FETCH_FIRST. This is only used when incremental collect is set to false, otherwise FETCH_FIRST will trigger re-execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Thank you for review, @ericl .

@dongjoon-hyun
Copy link
Member Author

The PR is updated like the followings.

  • Add SQLConf.THRIFTSERVER_INCREMENTAL_COLLECT.
  • Remove private def useIncrementalCollect.
  • Add description for resultList variable.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70942 has finished for PR 16440 at commit fb4ee89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @ericl and @srowen .
If there is something to do more, please let me know.
Thank you always.

.stringConf
.createOptional

val THRIFTSERVER_INCREMENTAL_COLLECT =
Copy link
Member

@srowen srowen Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes this option more "public"; I see some other options here marked as .internal(). I don't know whether this is meant to be further exposed. It might be more conservative to make it internal for the moment? but yes seems sensible to make a config key constant like this.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I'll make that internal. According to the current usage. internal seems to be better.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70989 has finished for PR 16440 at commit e66b165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you for approving, @srowen .

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen .
Could you merge this PR?

@asfgit asfgit closed this in a2c6adc Jan 10, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you for merging, @srowen !

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen . May I create a backport for 2.0 and 2.1 ?
#14218 was merged into branch-2.0 and branch-2.1.

asfgit pushed a commit that referenced this pull request Jan 12, 2017
…lect` in Thrift Server

## What changes were proposed in this pull request?

To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However,
Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**,
this causes GC and hangs for queries with large number of rows. We should not use this,
especially for `spark.sql.thriftServer.incrementalCollect`.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

## How was this patch tested?

Pass the existing tests.

Author: Dongjoon Hyun <[email protected]>

Closes #16440 from dongjoon-hyun/SPARK-18857.

(cherry picked from commit a2c6adc)
Signed-off-by: Sean Owen <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 12, 2017
…lect` in Thrift Server

## What changes were proposed in this pull request?

To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However,
Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**,
this causes GC and hangs for queries with large number of rows. We should not use this,
especially for `spark.sql.thriftServer.incrementalCollect`.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

## How was this patch tested?

Pass the existing tests.

Author: Dongjoon Hyun <[email protected]>

Closes #16440 from dongjoon-hyun/SPARK-18857.

(cherry picked from commit a2c6adc)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Jan 12, 2017

Merged to 2.1/2.0 as well. I agree, it's a clean fix for a non-trivial problem.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @srowen !

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…lect` in Thrift Server

## What changes were proposed in this pull request?

To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However,
Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**,
this causes GC and hangs for queries with large number of rows. We should not use this,
especially for `spark.sql.thriftServer.incrementalCollect`.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

## How was this patch tested?

Pass the existing tests.

Author: Dongjoon Hyun <[email protected]>

Closes apache#16440 from dongjoon-hyun/SPARK-18857.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…lect` in Thrift Server

## What changes were proposed in this pull request?

To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However,
Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**,
this causes GC and hangs for queries with large number of rows. We should not use this,
especially for `spark.sql.thriftServer.incrementalCollect`.

https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300

## How was this patch tested?

Pass the existing tests.

Author: Dongjoon Hyun <[email protected]>

Closes apache#16440 from dongjoon-hyun/SPARK-18857.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-18857 branch January 7, 2019 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants