-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14838][SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer #12599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| def unapply(plan: LogicalPlan): Option[LogicalPlan] = { | ||
| if (plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold) { | ||
| // We can't estimate the size of ObjectType | ||
| if (plan.find(_.isInstanceOf[ObjectProducer]).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the size currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectType simply throws exception if we call its defaultSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectProducer does not always produce objects, think about int encoder. We should check if the output is ObjectType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. Let me add the check too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case, right? We are facing the same issue as long as we calculate the statistics values.
|
Can you construct a query that can trigger this bug? |
|
@cloud-fan Just added. |
|
After think about it, it is better to implement |
|
@cloud-fan That does not help. The parent |
|
We need to change the implementation of |
|
@cloud-fan Sounds good. |
|
Let me try it... |
|
The object operators are really special, it breaks the contract that operator will always produce unsafe rows, so their usage is quite limited. Generally speaking, an |
|
Yea I see. |
|
@cloud-fan Is there guarantee that an |
How about this plan? I am still unable to catch your main points. |
|
When we calculating the statistics of |
|
@gatorsmile Look at the |
|
@viirya Nope. Actually, I did that before. It does not work. The issue is its parent node's statistics calculation triggers the exception. |
|
The problem is the parent node calls the val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = output.map(_.dataType.defaultSize).sum + 8Thus, we should check the dataType here. |
|
@viirya , yea you are right, @gatorsmile I may misunderstand your point. What do you mean by |
|
Test build #56647 has finished for PR 12599 at commit
|
|
@cloud-fan If we do not produce objects, it should work. Otherwise, we will hit the exception when the parent node calculates the statistics: spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Line 307 in da88592
Previously, I just simply use the child's |
|
Test build #56650 has finished for PR 12599 at commit
|
| // We can't estimate the size of ObjectType. We implement statistics here to avoid | ||
| // directly estimate any child plan which produces domain objects as output. | ||
| override def statistics: Statistics = { | ||
| if (child.output.find(_.dataType.isInstanceOf[ObjectType]).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can just do child.output.head.isInstanceOf[ObjectType], this is guarantted in:
trait ObjectConsumer extends UnaryNode {
assert(child.output.length == 1)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. yes.
| // directly estimate any child plan which produces domain objects as output. | ||
| override def statistics: Statistics = { | ||
| if (child.output.head.dataType.isInstanceOf[ObjectType]) { | ||
| Statistics(sizeInBytes = Long.MaxValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bring back this discussion: #12599 (comment)
We can calculate the row size by this.output.map(_.dataType), can't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So your point is to store numRows instead of sizeInBytes in Statistics? Is it any benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think is that we need to manipulate sizeInBytes directly in statistics method of some logical plans, such as summing up children's sizeInBytes. So it is more convenient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or should we go deeper to find the first child that doesn't output objects and take its statistic? Returning Long.max means we can't broadcast join a plan having object operators, which is bad for Dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the difference between estimated sizeInBytes is acceptable, I think we can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the logic here. Now it looks for an underlying logical plan that can be used to construct useful statistics.
|
Test build #56665 has finished for PR 12599 at commit
|
|
Test build #56669 has finished for PR 12599 at commit
|
| // directly estimate any child plan which produces domain objects as output. | ||
| override def statistics: Statistics = { | ||
| if (child.output.head.dataType.isInstanceOf[ObjectType]) { | ||
| val underlyingPlan = child.find { p => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for the case of MapGroups, I use the default way to calculate the sizeInBytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not revert to Long.max immediately and see others comments first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just have a default size (for example, 4k) for ObjectType ?
Since we will have better estimation on ObjectConsumer, the default size of ObjectType does not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The danger of default size for ObjectType is to underestimate the size of domain object output. Then we might broadcast a big size plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not understand what we will have better estimation on ObjectConsumer means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ObjectProducer always sit in the middle of query plan (especially for join), the direct children of join can't be ObjectProducer.
Thinking of three operators: SQL operator -> ObjectProducer -> ObjectConsumer (produce UnsafeRow). The data size of logical plan depends on the number of rows and the size of each row, the default size of Object only affect size of row. The estimation of ObjectConsumer only depends on number of rows from ObjectProducer and the size of row produced by it self, this means the size of object will NOT change the size of ObjectConsumer.
Since we can have better estimation on ObjectConsumer, so the estimation of ObjectProducer do not matter (the number of rows matter, but size of row do not matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the 4k default size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies Thanks. That makes sense.
|
Test build #56687 has finished for PR 12599 at commit
|
|
Test build #56690 has finished for PR 12599 at commit
|
|
Test build #56820 has finished for PR 12599 at commit
|
| checkDataset(wideDF.map(_.getLong(0)), 0L until 10 : _*) | ||
| } | ||
|
|
||
| test("Estimate size on ObjectProducer will cause failure") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test case name is wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix. thanks. Please see if the new name is more proper.
|
Test build #56825 has finished for PR 12599 at commit
|
|
LGTM, |
|
Test FAILed. |
|
Unrelated failure. I think it is ok. Thanks. |
What changes were proposed in this pull request?
We have logical plans that produce domain objects which are
ObjectType. As we can't estimate the size ofObjectType, we throw anUnsupportedOperationExceptionif trying to do that. We should set a default size forObjectTypeto avoid this failure.How was this patch tested?
DatasetSuite.