-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache() #5714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #30963 has started for PR 5714 at commit |
|
Test build #30963 has finished for PR 5714 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test could be simpler... no need for the additional dfWithIdAndSquare column or the square function...
if we were not reading from cache, two consecutive invocations of dfWithId.collect will yield different IDs.
Makes the test simpler.
|
Thank you @harsha2010 , I've updated the code. |
|
Test build #31087 has started for PR 5714 at commit |
|
Test build #31087 has finished for PR 5714 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this conflict with #5265?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for reminding, I will submit code with the another approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng why do we need to make rdd returning the same instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin @liancheng I think the master code is quite reasonable to me, particularly for the RunnableCommand, we don't want to run them twice.
e.g.
sql("CREATE TABLE mytest AS SELECT * FROM src")
sql("CREATE TABLE mytest AS SELECT * FROM src").collect()
I've updated lots of code in the unit test, but now I am quite hesitate with the approach of def instead of the lazy val.
Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snippet you just mentioned should be OK. Even in master code, the CREATE TABLE command will also be executed twice. But the following doesn't:
val df = sql("CREAT TABLE ...")
df.collect()
df.collect()The command is executed while constructing the result DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @liancheng for the confusing. We have 2 approaches for this fixing, however, the approach this PR takes will impact the existed code as the examples that I gave above.
sql("CREATE TABLE mytest AS SELECT * FROM src").collect()
The CTAS will run twice, and will throws exception like TableAlreadyExisted.
b876ce3 to
c0dc28d
Compare
|
Test build #31099 has started for PR 5714 at commit |
|
Test build #31099 has finished for PR 5714 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this function call the other persist?
|
Test build #31127 has started for PR 5714 at commit |
|
Test build #31127 has finished for PR 5714 at commit
|
|
So I talked with @marmbrus offline about this, and we actually think it might be ok to just change all the lazy vals to defs (but keep the analyzed one as lazy val), as you originally did, and not worry about performance at the moment. The problem with the current approach is that cache no longer mutates the underlying dataframe, and users relying on the old dataframe reference will see the same problem. |
a9bf8c1 to
3b27c4f
Compare
|
Merged build triggered. |
|
Merged build started. |
|
Test build #31209 has started for PR 5714 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, the JDBCSuite.test DATE types will fail, because we take the INTERGER for DATE type as the internal representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the uncertainty about this pr at the moment, can you submit a separate pr to fix the date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will do that.
|
Test build #31209 has finished for PR 5714 at commit
|
|
Merged build finished. Test FAILed. |
|
Test FAILed. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #31248 has started for PR 5714 at commit |
|
Test build #31248 has finished for PR 5714 at commit
|
|
Merged build finished. Test FAILed. |
|
Test FAILed. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #33892 has started for PR 5714 at commit |
|
Test build #33892 has finished for PR 5714 at commit
|
|
Merged build finished. Test FAILed. |
0b296ea to
58ea8aa
Compare
|
Merged build triggered. |
|
Merged build started. |
|
Test build #33895 has started for PR 5714 at commit |
|
Test build #33895 has finished for PR 5714 at commit
|
|
Merged build finished. Test PASSed. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #34401 has started for PR 5714 at commit |
|
Test build #34401 has finished for PR 5714 at commit
|
|
Merged build finished. Test FAILed. |
|
retest this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #34436 has started for PR 5714 at commit |
|
Test build #34436 has finished for PR 5714 at commit
|
|
Merged build finished. Test PASSed. |
|
Thanks! Merging to master. |
… after cache() When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward. Author: Cheng Hao <[email protected]> Closes apache#5714 from chenghao-intel/SPARK-7158 and squashes the following commits: 58ea8aa [Cheng Hao] style issue 2bf740f [Cheng Hao] create new QueryExecution instance for CacheManager a5647d9 [Cheng Hao] hide the queryExecution of DataFrame fbfd3c5 [Cheng Hao] make the DataFrame.queryExecution mutable for cache/persist/unpersist
When df.cache() method called, the
withCachedDataofQueryExecutionhas been created, which mean it will not look up the cached tables when action method called afterward.Replace the lazy variable with method will fix this bug.