Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

When df.cache() method called, the withCachedData of QueryExecution has been created, which mean it will not look up the cached tables when action method called afterward.

Replace the lazy variable with method will fix this bug.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30963 has started for PR 5714 at commit e2c4298.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30963 has finished for PR 5714 at commit e2c4298.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test could be simpler... no need for the additional dfWithIdAndSquare column or the square function...
if we were not reading from cache, two consecutive invocations of dfWithId.collect will yield different IDs.
Makes the test simpler.

@chenghao-intel
Copy link
Contributor Author

Thank you @harsha2010 , I've updated the code.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31087 has started for PR 5714 at commit b876ce3.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31087 has finished for PR 5714 at commit b876ce3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conflict with #5265?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for reminding, I will submit code with the another approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng why do we need to make rdd returning the same instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin @liancheng I think the master code is quite reasonable to me, particularly for the RunnableCommand, we don't want to run them twice.
e.g.

sql("CREATE TABLE mytest AS SELECT * FROM src")
sql("CREATE TABLE mytest AS SELECT * FROM src").collect()

I've updated lots of code in the unit test, but now I am quite hesitate with the approach of def instead of the lazy val.

Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet you just mentioned should be OK. Even in master code, the CREATE TABLE command will also be executed twice. But the following doesn't:

val df = sql("CREAT TABLE ...")
df.collect()
df.collect()

The command is executed while constructing the result DataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @liancheng for the confusing. We have 2 approaches for this fixing, however, the approach this PR takes will impact the existed code as the examples that I gave above.

sql("CREATE TABLE mytest AS SELECT * FROM src").collect()

The CTAS will run twice, and will throws exception like TableAlreadyExisted.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31099 has started for PR 5714 at commit c0dc28d.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31099 has finished for PR 5714 at commit c0dc28d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this function call the other persist?

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31127 has started for PR 5714 at commit a9bf8c1.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31127 has finished for PR 5714 at commit a9bf8c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@rxin
Copy link
Contributor

rxin commented Apr 28, 2015

So I talked with @marmbrus offline about this, and we actually think it might be ok to just change all the lazy vals to defs (but keep the analyzed one as lazy val), as you originally did, and not worry about performance at the moment.

The problem with the current approach is that cache no longer mutates the underlying dataframe, and users relying on the old dataframe reference will see the same problem.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31209 has started for PR 5714 at commit 3b27c4f.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the JDBCSuite.test DATE types will fail, because we take the INTERGER for DATE type as the internal representation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the uncertainty about this pr at the moment, can you submit a separate pr to fix the date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will do that.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31209 has finished for PR 5714 at commit 3b27c4f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31209/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31248 has started for PR 5714 at commit 2005c94.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31248 has finished for PR 5714 at commit 2005c94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31248/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33892 has started for PR 5714 at commit 0b296ea.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33892 has finished for PR 5714 at commit 0b296ea.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33895 has started for PR 5714 at commit 58ea8aa.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33895 has finished for PR 5714 at commit 58ea8aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 8, 2015

Test build #34401 has started for PR 5714 at commit 58ea8aa.

@SparkQA
Copy link

SparkQA commented Jun 8, 2015

Test build #34401 has finished for PR 5714 at commit 58ea8aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@chenghao-intel
Copy link
Contributor Author

retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 8, 2015

Test build #34436 has started for PR 5714 at commit 58ea8aa.

@SparkQA
Copy link

SparkQA commented Jun 8, 2015

Test build #34436 has finished for PR 5714 at commit 58ea8aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@marmbrus
Copy link
Contributor

Thanks! Merging to master.

@asfgit asfgit closed this in 767cc94 Jun 12, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… after cache()

When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward.

Author: Cheng Hao <[email protected]>

Closes apache#5714 from chenghao-intel/SPARK-7158 and squashes the following commits:

58ea8aa [Cheng Hao] style issue
2bf740f [Cheng Hao] create new QueryExecution instance for CacheManager
a5647d9 [Cheng Hao] hide the queryExecution of DataFrame
fbfd3c5 [Cheng Hao] make the DataFrame.queryExecution mutable for cache/persist/unpersist
@chenghao-intel chenghao-intel deleted the SPARK-7158 branch July 2, 2015 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants