-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25497][SQL] Limit operation within whole stage codegen should not consume all the inputs #22524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25497][SQL] Limit operation within whole stage codegen should not consume all the inputs #22524
Changes from all commits
12703bd
a09e60f
2f4d356
6d95b65
ed2c269
1b2ab61
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,6 +38,11 @@ public abstract class BufferedRowIterator { | |
|
|
||
| protected int partitionIndex = -1; | ||
|
|
||
| // This indicates whether the query execution should be stopped even the input rows are still | ||
| // available. This is used in limit operator. When it reaches the given number of rows to limit, | ||
| // this flag is set and the execution should be stopped. | ||
| protected boolean isStopEarly = false; | ||
|
|
||
| public boolean hasNext() throws IOException { | ||
| if (currentRows.isEmpty()) { | ||
| processNext(); | ||
|
|
@@ -73,14 +78,26 @@ public void append(InternalRow row) { | |
| currentRows.add(row); | ||
| } | ||
|
|
||
| /** | ||
| * Sets the flag of stopping the query execution early under whole-stage codegen. | ||
| * | ||
| * This has two use cases: | ||
| * 1. Limit operators should call it with true when the given limit number is reached. | ||
| * 2. Blocking operators (sort, aggregate, etc.) should call it with false to reset it after | ||
| * consuming all records from upstream. | ||
| */ | ||
| public void setStopEarly(boolean value) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we have more documents about how to use it? For now I see 2 use cases:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. Let me add it.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You also hint me that we should reset stop early flag in sort exec node too. I will add it and related test. |
||
| isStopEarly = value; | ||
| } | ||
|
|
||
| /** | ||
| * Returns whether this iterator should stop fetching next row from [[CodegenSupport#inputRDDs]]. | ||
| * | ||
| * If it returns true, the caller should exit the loop that [[InputAdapter]] generates. | ||
| * This interface is mainly used to limit the number of input rows. | ||
| */ | ||
| public boolean stopEarly() { | ||
| return false; | ||
| return isStopEarly; | ||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -71,22 +71,15 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport { | |
| } | ||
|
|
||
| override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { | ||
| val stopEarly = | ||
| ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "stopEarly") // init as stopEarly = false | ||
|
|
||
| ctx.addNewFunction("stopEarly", s""" | ||
| @Override | ||
| protected boolean stopEarly() { | ||
| return $stopEarly; | ||
| } | ||
| """, inlineToOuterClass = true) | ||
| val countTerm = ctx.addMutableState(CodeGenerator.JAVA_INT, "count") // init as count = 0 | ||
| s""" | ||
| | if ($countTerm < $limit) { | ||
| | $countTerm += 1; | ||
| | ${consume(ctx, input)} | ||
| | } else { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to remove this? Isn't it safer to let it here?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we don't execute into it. If we do, there should be a bug. |
||
| | $stopEarly = true; | ||
| | | ||
| | if ($countTerm == $limit) { | ||
| | setStopEarly(true); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we do this after
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. won't we call
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. And I think
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually as I'm just looking at the query again, there should not be a The cases having But for safety, I think I will also move this after |
||
| | } | ||
| | } | ||
| """.stripMargin | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -556,7 +556,7 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { | |
| Seq(Row(1, 2, Seq("a", "b")), Row(3, 2, Seq("c", "c", "d")))) | ||
| } | ||
|
|
||
| test("SPARK-18004 limit + aggregates") { | ||
| test("SPARK-18528 limit + aggregates") { | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This JIRA number is wrong. |
||
| val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", "value") | ||
| val limit2Df = df.limit(2) | ||
| checkAnswer( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there are 2 limits in the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test for 2 limits.
When any of 2 limits sets
isStopEarly, I think the execution should be stopped. Is there any case opposite to this?