-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17549][sql] Coalesce cached relation stats in driver. #15189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -232,4 +232,29 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { | |
| val columnTypes2 = List.fill(length2)(IntegerType) | ||
| val columnarIterator2 = GenerateColumnAccessor.generate(columnTypes2) | ||
| } | ||
|
|
||
| test("SPARK-17549: cached table size should be correctly calculated") { | ||
| val data = spark.sparkContext.parallelize(1 to 10, 5).map { i => (i, i.toLong) } | ||
| .toDF("col1", "col2") | ||
| val plan = spark.sessionState.executePlan(data.logicalPlan).sparkPlan | ||
| val cached = InMemoryRelation(true, 5, MEMORY_ONLY, plan, None) | ||
|
|
||
| // Materialize the data. | ||
| val expectedAnswer = data.collect() | ||
| checkAnswer(cached, expectedAnswer) | ||
|
|
||
| // Check that the right size was calculated. | ||
| val expectedColSizes = expectedAnswer.size * (INT.defaultSize + LONG.defaultSize) | ||
| assert(cached.statistics.sizeInBytes === expectedColSizes) | ||
|
|
||
| // Create a projection of the cached data and make sure the statistics are correct. | ||
| val projected = cached.withOutput(Seq(plan.output.last)) | ||
| assert(projected.statistics.sizeInBytes === expectedAnswer.size * LONG.defaultSize) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure if I understand the last two parts. After we cache the dataset, I am not sure if we can change the number of output columns (this test) or the data types (the next one). If we do a project on the cached dataset, we will see a project operator on top of the InMemoryRelation. I am wondering what kinds of queries can cause this problems?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I looked again at a heap dump with a couple of cached relations and you're right; I had misinterpreted the previous data. I'll remove these tests and simplify the code. Still I'd be a little more comfortable if there was an assert in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually... in that case, isn't my previous patch correct? (#15112) My worry about that patch was multiple cached relations with different outputs sharing the same accumulator. But if that doesn't happen, then that patch is enough.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yhuai given the above, is it ok if I just revert your revert of my previous patch?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh i see. Sorry, I may have missed something. How to reproduce the problem that led us to revert the previous PR?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let me check with @liancheng
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I double checked the code. The So, even we use
Although we may have different outputs, they are still representing the same dataset. So, seems it is fine if they have the same accumulator.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for confirming. So we should be fine with the previous patch. |
||
|
|
||
| // Create a silly projection that repeats columns of the first cached relation, and | ||
| // check that the size is calculated correctly. | ||
| val projected2 = cached.withOutput(Seq(plan.output.last, plan.output.last)) | ||
| assert(projected2.statistics.sizeInBytes === 2 * expectedAnswer.size * LONG.defaultSize) | ||
| } | ||
|
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the class name explicitly say that it is for sizeInBytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to leave it generic in case other stats need to be added later, but not worries, I can change the name.