[SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache() #5714

chenghao-intel · 2015-04-27T07:41:12Z

When df.cache() method called, the withCachedData of QueryExecution has been created, which mean it will not look up the cached tables when action method called afterward.

Replace the lazy variable with method will fix this bug.

SparkQA · 2015-04-27T08:18:40Z

Test build #30963 has started for PR 5714 at commit e2c4298.

SparkQA · 2015-04-27T08:20:13Z

Test build #30963 has finished for PR 5714 at commit e2c4298.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

harsha2010 · 2015-04-27T15:13:30Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

this test could be simpler... no need for the additional dfWithIdAndSquare column or the square function...
if we were not reading from cache, two consecutive invocations of dfWithId.collect will yield different IDs.
Makes the test simpler.

chenghao-intel · 2015-04-28T00:30:50Z

Thank you @harsha2010 , I've updated the code.

SparkQA · 2015-04-28T00:33:52Z

Test build #31087 has started for PR 5714 at commit b876ce3.

SparkQA · 2015-04-28T01:14:36Z

Test build #31087 has finished for PR 5714 at commit b876ce3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

cloud-fan · 2015-04-28T01:20:17Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

Is this conflict with #5265?

Yes, thanks for reminding, I will submit code with the another approach.

@liancheng why do we need to make rdd returning the same instance?

@rxin @liancheng I think the master code is quite reasonable to me, particularly for the RunnableCommand, we don't want to run them twice.
e.g.

sql("CREATE TABLE mytest AS SELECT * FROM src") sql("CREATE TABLE mytest AS SELECT * FROM src").collect()

I've updated lots of code in the unit test, but now I am quite hesitate with the approach of def instead of the lazy val.

Any ideas?

The snippet you just mentioned should be OK. Even in master code, the CREATE TABLE command will also be executed twice. But the following doesn't:

val df = sql("CREAT TABLE ...") df.collect() df.collect()

The command is executed while constructing the result DataFrame.

Sorry, @liancheng for the confusing. We have 2 approaches for this fixing, however, the approach this PR takes will impact the existed code as the examples that I gave above.

sql("CREATE TABLE mytest AS SELECT * FROM src").collect()

The CTAS will run twice, and will throws exception like TableAlreadyExisted.

SparkQA · 2015-04-28T02:03:52Z

Test build #31099 has started for PR 5714 at commit c0dc28d.

SparkQA · 2015-04-28T04:16:01Z

Test build #31099 has finished for PR 5714 at commit c0dc28d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

rxin · 2015-04-28T06:20:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

can this function call the other persist?

SparkQA · 2015-04-28T07:12:48Z

Test build #31127 has started for PR 5714 at commit a9bf8c1.

SparkQA · 2015-04-28T08:56:13Z

Test build #31127 has finished for PR 5714 at commit a9bf8c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

rxin · 2015-04-28T22:44:10Z

So I talked with @marmbrus offline about this, and we actually think it might be ok to just change all the lazy vals to defs (but keep the analyzed one as lazy val), as you originally did, and not worry about performance at the moment.

The problem with the current approach is that cache no longer mutates the underlying dataframe, and users relying on the old dataframe reference will see the same problem.

AmplabJenkins · 2015-04-29T02:07:13Z

Merged build triggered.

AmplabJenkins · 2015-04-29T02:07:19Z

Merged build started.

SparkQA · 2015-04-29T02:07:58Z

Test build #31209 has started for PR 5714 at commit 3b27c4f.

chenghao-intel · 2015-04-29T02:20:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificMutableRow.scala

Without this, the JDBCSuite.test DATE types will fail, because we take the INTERGER for DATE type as the internal representation.

given the uncertainty about this pr at the moment, can you submit a separate pr to fix the date?

Yeah, I will do that.

SparkQA · 2015-04-29T03:06:52Z

Test build #31209 has finished for PR 5714 at commit 3b27c4f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-29T03:06:56Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-04-29T03:06:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31209/
Test FAILed.

AmplabJenkins · 2015-04-29T05:52:11Z

Merged build triggered.

AmplabJenkins · 2015-04-29T05:52:21Z

Merged build started.

SparkQA · 2015-04-29T05:53:49Z

Test build #31248 has started for PR 5714 at commit 2005c94.

SparkQA · 2015-04-29T07:00:23Z

Test build #31248 has finished for PR 5714 at commit 2005c94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-29T07:00:27Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-04-29T07:00:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31248/
Test FAILed.

AmplabJenkins · 2015-06-01T14:27:12Z

Merged build triggered.

AmplabJenkins · 2015-06-01T14:27:18Z

Merged build started.

SparkQA · 2015-06-01T14:29:24Z

Test build #33892 has started for PR 5714 at commit 0b296ea.

SparkQA · 2015-06-01T14:31:01Z

Test build #33892 has finished for PR 5714 at commit 0b296ea.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-01T14:31:01Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-06-01T15:02:13Z

Merged build triggered.

AmplabJenkins · 2015-06-01T15:02:23Z

Merged build started.

SparkQA · 2015-06-01T15:02:55Z

Test build #33895 has started for PR 5714 at commit 58ea8aa.

SparkQA · 2015-06-01T16:56:02Z

Test build #33895 has finished for PR 5714 at commit 58ea8aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-01T16:56:07Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-06-08T00:12:12Z

Merged build triggered.

AmplabJenkins · 2015-06-08T00:12:18Z

Merged build started.

SparkQA · 2015-06-08T00:14:43Z

Test build #34401 has started for PR 5714 at commit 58ea8aa.

SparkQA · 2015-06-08T01:12:18Z

Test build #34401 has finished for PR 5714 at commit 58ea8aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-08T01:12:22Z

Merged build finished. Test FAILed.

chenghao-intel · 2015-06-08T13:29:30Z

retest this please.

AmplabJenkins · 2015-06-08T13:32:12Z

Merged build triggered.

AmplabJenkins · 2015-06-08T13:32:22Z

Merged build started.

SparkQA · 2015-06-08T13:34:52Z

Test build #34436 has started for PR 5714 at commit 58ea8aa.

SparkQA · 2015-06-08T15:16:35Z

Test build #34436 has finished for PR 5714 at commit 58ea8aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-08T15:16:39Z

Merged build finished. Test PASSed.

marmbrus · 2015-06-12T01:02:07Z

Thanks! Merging to master.

… after cache() When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward. Author: Cheng Hao <[email protected]> Closes apache#5714 from chenghao-intel/SPARK-7158 and squashes the following commits: 58ea8aa [Cheng Hao] style issue 2bf740f [Cheng Hao] create new QueryExecution instance for CacheManager a5647d9 [Cheng Hao] hide the queryExecution of DataFrame fbfd3c5 [Cheng Hao] make the DataFrame.queryExecution mutable for cache/persist/unpersist

harsha2010 reviewed Apr 27, 2015
View reviewed changes

cloud-fan reviewed Apr 28, 2015
View reviewed changes

chenghao-intel force-pushed the SPARK-7158 branch from b876ce3 to c0dc28d Compare April 28, 2015 01:57

rxin reviewed Apr 28, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala Outdated

Copy link

Contributor

rxin Apr 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this function call the other persist?

chenghao-intel force-pushed the SPARK-7158 branch from a9bf8c1 to 3b27c4f Compare April 29, 2015 02:02

chenghao-intel reviewed Apr 29, 2015
View reviewed changes

chenghao-intel added 4 commits June 1, 2015 07:50

make the DataFrame.queryExecution mutable for cache/persist/unpersist

fbfd3c5

hide the queryExecution of DataFrame

a5647d9

create new QueryExecution instance for CacheManager

2bf740f

style issue

58ea8aa

chenghao-intel force-pushed the SPARK-7158 branch from 0b296ea to 58ea8aa Compare June 1, 2015 14:59

asfgit closed this in 767cc94 Jun 12, 2015

chenghao-intel deleted the SPARK-7158 branch July 2, 2015 08:33

[SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache() #5714

[SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache() #5714

Uh oh!

Conversation

chenghao-intel commented Apr 27, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

rxin commented Apr 28, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

SparkQA commented Apr 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

SparkQA commented Apr 29, 2015

Uh oh!

SparkQA commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Jun 1, 2015

Uh oh!

AmplabJenkins commented Jun 1, 2015

Uh oh!

SparkQA commented Jun 1, 2015

Uh oh!

SparkQA commented Jun 1, 2015

Uh oh!