-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18857][SQL] Don't use Iterator.duplicate for incrementalCollect in Thrift Server
#16440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
125e79c
00bb52d
fb4ee89
e66b165
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,8 +50,13 @@ private[hive] class SparkExecuteStatementOperation( | |
| with Logging { | ||
|
|
||
| private var result: DataFrame = _ | ||
|
|
||
| // We cache the returned rows to get iterators again in case the user wants to use FETCH_FIRST. | ||
| // This is only used when `spark.sql.thriftServer.incrementalCollect` is set to `false`. | ||
| // In case of `true`, this will be `None` and FETCH_FIRST will trigger re-execution. | ||
| private var resultList: Option[Array[SparkRow]] = _ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we document these two fields, e.g. we cache the returned rows in resultList in case the user wants to use FETCH_FIRST. This is only used when incremental collect is set to false, otherwise FETCH_FIRST will trigger re-execution.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure! Thank you for review, @ericl . |
||
|
|
||
| private var iter: Iterator[SparkRow] = _ | ||
| private var iterHeader: Iterator[SparkRow] = _ | ||
| private var dataTypes: Array[DataType] = _ | ||
| private var statementId: String = _ | ||
|
|
||
|
|
@@ -111,9 +116,15 @@ private[hive] class SparkExecuteStatementOperation( | |
|
|
||
| // Reset iter to header when fetching start from first row | ||
| if (order.equals(FetchOrientation.FETCH_FIRST)) { | ||
| val (ita, itb) = iterHeader.duplicate | ||
| iter = ita | ||
| iterHeader = itb | ||
| iter = if (sqlContext.getConf(SQLConf.THRIFTSERVER_INCREMENTAL_COLLECT.key).toBoolean) { | ||
| resultList = None | ||
| result.toLocalIterator.asScala | ||
| } else { | ||
| if (resultList.isEmpty) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that this makes the implicit buffering implicit. So, if an iterator is duplicated into A and B, and all of A is consumed, then B will internally buffer everything from A so it can be replayed? and in our case, we know that A will be entirely consumed? then these are basically the same, yes. But, does that solve the problem? this now always stores the whole result set locally. Is this avoiding a second whole copy of it? What if you always just return result.collect().iterator here -- the problem is the re-collecting the result every time?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. The following happens with
And, the whole result storing happens line 122 and line 245-246 for
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose I'm asking, why is this an improvement? because in the new version, you also buffer the whole result into memory locally.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct. There are two cases and this PR targets
First of all, before SPARK-16563,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK I believe I get it now. I see your approach and it makes sense. Just wondering if that's actually simpler, to avoid keeping a reference to the whole data set, or whether that defeats a purpose.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When However, if users executes a query with |
||
| resultList = Some(result.collect()) | ||
| } | ||
| resultList.get.iterator | ||
| } | ||
| } | ||
|
|
||
| if (!iter.hasNext) { | ||
|
|
@@ -227,17 +238,14 @@ private[hive] class SparkExecuteStatementOperation( | |
| } | ||
| HiveThriftServer2.listener.onStatementParsed(statementId, result.queryExecution.toString()) | ||
| iter = { | ||
| val useIncrementalCollect = | ||
| sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", "false").toBoolean | ||
| if (useIncrementalCollect) { | ||
| if (sqlContext.getConf(SQLConf.THRIFTSERVER_INCREMENTAL_COLLECT.key).toBoolean) { | ||
| resultList = None | ||
| result.toLocalIterator.asScala | ||
| } else { | ||
| result.collect().iterator | ||
| resultList = Some(result.collect()) | ||
| resultList.get.iterator | ||
| } | ||
| } | ||
| val (itra, itrb) = iter.duplicate | ||
| iterHeader = itra | ||
| iter = itrb | ||
| dataTypes = result.queryExecution.analyzed.output.map(_.dataType).toArray | ||
| } catch { | ||
| case e: HiveSQLException => | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes this option more "public"; I see some other options here marked as
.internal(). I don't know whether this is meant to be further exposed. It might be more conservative to make it internal for the moment? but yes seems sensible to make a config key constant like this.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I'll make that internal. According to the current usage.
internalseems to be better.