-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11009] [SQL] fix wrong result of Window function in cluster mode #9050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #1869 has finished for PR 9050 at commit
|
|
We have another place to create the attribute references (https://github.com/davies/spark/blob/wrong_window/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L149). Also change this? |
|
Test build #1870 has finished for PR 9050 at commit
|
|
Test build #43505 has finished for PR 9050 at commit
|
|
Thank you for fixing it! How about we add a test in |
|
@hvanhovell Yes, we will backport it to 1.5 branch. So it will be fixed in 1.5.2. Let me explain the cause. Every attribute reference has an |
|
@yhuai |
|
@yhuai @hvanhovell I had changed to use BoundReference instead of Attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that the input row of this projection has more elements than child.output? Maybe it is not very easy to understand this subtle change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we put all the windowResult on the right side of child.output, patchedWindowExpression will be pointed to them.
|
Test build #43590 has finished for PR 9050 at commit
|
|
@davies I created a test based on the case you put in the description (yhuai@bc566fa). Seems it indeed fails. How about we add it in the PR? |
|
@yhuai Thanks, pulled in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry. Actually, can we do df3.rdd.count() to make sure we do materialize all results?
|
Test build #43618 has finished for PR 9050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. My bad.... This condition should be !=. Can we run this action a few more times to make it have a higher chance of throwing the exception when it is broken (say 10 times)? It should not make this test much more expensive because the most expensive part is creating the HiveContext.
(1 to 10).foreach { i =>
val count = df3.rdd.count()
if (count != 0) {
throw new Exception(s"df3 should have 0 row. However $count rows got returned.")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If failure should only happen for the first time (having the same id both in driver and in executor). We can increase the number of partitions to increase the change to fail.
|
Test build #43621 has finished for PR 9050 at commit
|
|
Test build #43624 has finished for PR 9050 at commit
|
|
Test build #1887 has finished for PR 9050 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to throw an exception at here. Otherwise, even this assertion fails, the test will pass (because we are running an application at here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it while I merging it.
Currently, All windows function could generate wrong result in cluster sometimes.
The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver.
Here is the script that could reproduce the problem (run in local cluster):
```
from pyspark import SparkContext, HiveContext
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
sqlContext = HiveContext(SparkContext())
sqlContext.setConf("spark.sql.shuffle.partitions", "3")
df = sqlContext.range(1<<20)
df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
ws = Window.partitionBy(df2.A).orderBy(df2.B)
df3 = df2.select("client", "date", rowNumber().over(ws).alias("rn")).filter("rn < 0")
assert df3.count() == 0
```
Author: Davies Liu <[email protected]>
Author: Yin Huai <[email protected]>
Closes #9050 from davies/wrong_window.
Conflicts:
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
Currently, All windows function could generate wrong result in cluster sometimes.
The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver.
Here is the script that could reproduce the problem (run in local cluster):