[SPARK-14838][SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer #12599

viirya · 2016-04-22T04:37:42Z

What changes were proposed in this pull request?

We have logical plans that produce domain objects which are ObjectType. As we can't estimate the size of ObjectType, we throw an UnsupportedOperationException if trying to do that. We should set a default size for ObjectType to avoid this failure.

How was this patch tested?

DatasetSuite.

rxin · 2016-04-22T04:52:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

    def unapply(plan: LogicalPlan): Option[LogicalPlan] = {
-      if (plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold) {
+      // We can't estimate the size of ObjectType
+      if (plan.find(_.isInstanceOf[ObjectProducer]).isDefined) {


what's the size currently?

ObjectType simply throws exception if we call its defaultSize.

ObjectProducer does not always produce objects, think about int encoder. We should check if the output is ObjectType

make sense. Let me add the check too.

This is a special case, right? We are facing the same issue as long as we calculate the statistics values.

cloud-fan · 2016-04-22T05:02:02Z

Can you construct a query that can trigger this bug?

viirya · 2016-04-22T05:03:38Z

@cloud-fan Just added.

cloud-fan · 2016-04-22T05:15:06Z

After think about it, it is better to implement statistics in SerializeFromObject? The default implementation in UnaryNode is wrong for it.

gatorsmile · 2016-04-22T05:18:24Z

@cloud-fan That does not help. The parent UnaryNode can be any type. We will face this issue when its child is SerializeFromObject

gatorsmile · 2016-04-22T05:19:42Z

We need to change the implementation of UnaryNode's the default statistics calculation. However, the other node types that do not use this default method still face the same issue.

viirya · 2016-04-22T05:23:09Z

@cloud-fan Sounds good.

viirya · 2016-04-22T05:23:49Z

Let me try it...

cloud-fan · 2016-04-22T05:28:05Z

The object operators are really special, it breaks the contract that operator will always produce unsafe rows, so their usage is quite limited. Generally speaking, an ObjectProducer will always have an ObjectConsumer nearby to turn it back to unsafe rows. So if we implement statistics for ObjectConsumer, there should be nowhere else that will estimate size for object type.

viirya · 2016-04-22T05:30:06Z

Yea I see.

viirya · 2016-04-22T05:32:26Z

@cloud-fan Is there guarantee that an ObjectConsumer can't produce domain object? If no, I think it is safer to implement statistics in SerializeFromObject, instead of ObjectConsumer?

gatorsmile · 2016-04-22T05:33:52Z

== Optimized Logical Plan ==
Project [user#7,recommendations#48 AS prediction#77,actual#65 AS label#78]
+- Join Inner, Some((user#7 = id#64))
   :- Project [user#7,recommendations#48]
   :  +- Join Inner, Some((user#7 = id#47))
   :     :- Aggregate [user#7], [user#7]
   :     :  +- LocalRelation [user#7], [[0],[0],[0],[1],[1],[1],[2],[2],[2]]
   :     +- Project [_1#44 AS id#47,_2#45 AS recommendations#48]
   :        +- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[19] at rddToDatasetHolder at ALS.scala:335
   +- Project [_1#61 AS id#64,_2#62 AS actual#65]
      +- Filter isnotnull(_1#61)
         +- SerializeFromObject [input[0, scala.Tuple2]._1 AS _1#61,newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS _2#62]
            +- MapGroups <function2>, value#55: int, newInstance(class scala.Tuple2), [value#55], [user#7,item#8], obj#60: scala.Tuple2
               +- AppendColumns <function1>, newInstance(class scala.Tuple2), [input[0, int] AS value#55]
                  +- LocalRelation [user#7,item#8], [[0,3],[0,4],[0,5],[1,3],[1,5],[1,4],[2,3],[2,5],[2,4]]

How about this plan? I am still unable to catch your main points.

gatorsmile · 2016-04-22T05:36:11Z

When we calculating the statistics of Filter, we hit the issue caused by the UnaryNode's default statistics calculation, right?

viirya · 2016-04-22T05:36:49Z

@gatorsmile Look at the SerializeFromObject in your plan. If we implement statistics in it, we can skip estimating size of MapGroups which produces domain objects.

gatorsmile · 2016-04-22T05:40:31Z

@viirya Nope. Actually, I did that before. It does not work. The issue is its parent node's statistics calculation triggers the exception.

gatorsmile · 2016-04-22T05:42:47Z

The problem is the parent node calls the defaultSize of its child's output.

    val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
    val outputRowSize = output.map(_.dataType.defaultSize).sum + 8

Thus, we should check the dataType here.

cloud-fan · 2016-04-22T05:49:31Z

@viirya , yea you are right, ObjectConsumer may also produce objects, so we should implement statistic in SerializeFromObject.

@gatorsmile I may misunderstand your point. What do you mean by The problem is the parent node calls the defaultSize of its child's output? If we make SerializeFromObject.statistics work, its parent should be OK too, as SerializeFromObject will never produce objects.

SparkQA · 2016-04-22T06:03:00Z

Test build #56647 has finished for PR 12599 at commit 84207c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-04-22T06:08:51Z

@cloud-fan If we do not produce objects, it should work. Otherwise, we will hit the exception when the parent node calculates the statistics:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

Line 307 in da88592

val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8

Previously, I just simply use the child's statistics value as the statistics of SerializeFromObject. Obviously, it does not help.

SparkQA · 2016-04-22T06:30:00Z

Test build #56650 has finished for PR 12599 at commit 8e0541c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-22T06:41:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala

+  // We can't estimate the size of ObjectType. We implement statistics here to avoid
+  // directly estimate any child plan which produces domain objects as output.
+  override def statistics: Statistics = {
+    if (child.output.find(_.dataType.isInstanceOf[ObjectType]).isDefined) {


nit: we can just do child.output.head.isInstanceOf[ObjectType], this is guarantted in:

trait ObjectConsumer extends UnaryNode { assert(child.output.length == 1) ...

cloud-fan · 2016-04-22T07:41:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala

+  // directly estimate any child plan which produces domain objects as output.
+  override def statistics: Statistics = {
+    if (child.output.head.dataType.isInstanceOf[ObjectType]) {
+      Statistics(sizeInBytes = Long.MaxValue)


bring back this discussion: #12599 (comment)

We can calculate the row size by this.output.map(_.dataType), can't we?

So your point is to store numRows instead of sizeInBytes in Statistics? Is it any benefit?

I can think is that we need to manipulate sizeInBytes directly in statistics method of some logical plans, such as summing up children's sizeInBytes. So it is more convenient?

or should we go deeper to find the first child that doesn't output objects and take its statistic? Returning Long.max means we can't broadcast join a plan having object operators, which is bad for Dataset

If the difference between estimated sizeInBytes is acceptable, I think we can do it.

I updated the logic here. Now it looks for an underlying logical plan that can be used to construct useful statistics.

SparkQA · 2016-04-22T08:19:36Z

Test build #56665 has finished for PR 12599 at commit 9d2033f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-22T08:31:00Z

Test build #56669 has finished for PR 12599 at commit 4b5f66d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… can give statistics.

cloud-fan · 2016-04-22T10:33:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala

+  // directly estimate any child plan which produces domain objects as output.
+  override def statistics: Statistics = {
+    if (child.output.head.dataType.isInstanceOf[ObjectType]) {
+      val underlyingPlan = child.find { p =>


After rethink about it, this can be very complex, e.g. MapGroups. Maybe we can just return Long.max and add a TODO here.

cc @marmbrus @davies

Actually, for the case of MapGroups, I use the default way to calculate the sizeInBytes.

I would not revert to Long.max immediately and see others comments first.

Can we just have a default size (for example, 4k) for ObjectType ?

Since we will have better estimation on ObjectConsumer, the default size of ObjectType does not matter.

The danger of default size for ObjectType is to underestimate the size of domain object output. Then we might broadcast a big size plan.

I may not understand what we will have better estimation on ObjectConsumer means.

The ObjectProducer always sit in the middle of query plan (especially for join), the direct children of join can't be ObjectProducer.

Thinking of three operators: SQL operator -> ObjectProducer -> ObjectConsumer (produce UnsafeRow). The data size of logical plan depends on the number of rows and the size of each row, the default size of Object only affect size of row. The estimation of ObjectConsumer only depends on number of rows from ObjectProducer and the size of row produced by it self, this means the size of object will NOT change the size of ObjectConsumer.

Since we can have better estimation on ObjectConsumer, so the estimation of ObjectProducer do not matter (the number of rows matter, but size of row do not matter).

+1 for the 4k default size

@davies Thanks. That makes sense.

SparkQA · 2016-04-22T10:42:45Z

Test build #56687 has finished for PR 12599 at commit dcd6056.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-22T11:09:04Z

Test build #56690 has finished for PR 12599 at commit 3ff11a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-24T01:17:22Z

Test build #56820 has finished for PR 12599 at commit 6b6c12d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-24T01:19:49Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

    checkDataset(wideDF.map(_.getLong(0)), 0L until 10 : _*)
  }
+
+  test("Estimate size on ObjectProducer will cause failure") {


the test case name is wrong?

fix. thanks. Please see if the new name is more proper.

SparkQA · 2016-04-24T04:07:10Z

Test build #56825 has finished for PR 12599 at commit c26c3bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-04-24T04:15:03Z

LGTM,
Merging this into master, thanks!

AmplabJenkins · 2016-04-24T04:17:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56832/
Test FAILed.

viirya · 2016-04-24T04:18:48Z

Unrelated failure. I think it is ok. Thanks.

Skip automatically broadcast a plan when it contains ObjectProducer.

84207c7

rxin reviewed Apr 22, 2016
View reviewed changes

Add test.

8e0541c

MLnick mentioned this pull request Apr 22, 2016

[SPARK-13857][ML][WIP] Add "recommend all" functionality in ALS #12574

Closed

3 tasks

viirya added 2 commits April 22, 2016 06:36

Different approach that implements statistices in SerializeFromObject.

bb9dd6f

Revert unused import.

9d2033f

cloud-fan reviewed Apr 22, 2016
View reviewed changes

Simplify condition check.

4b5f66d

viirya changed the title ~~[SPARK-14838][SQL] Skip automatically broadcast a plan when it contains ObjectProducer~~ [SPARK-14838][SQL] Implement statistics in SerializeFromObject to avoid failure when estimating sizeInBytes for ObjectType Apr 22, 2016

cloud-fan reviewed Apr 22, 2016
View reviewed changes

viirya added 2 commits April 22, 2016 09:15

Instead Long.MaxValue, trying to find an underlying logical plan that…

dcd6056

… can give statistics.

Remove wrong comment.

3ff11a1

cloud-fan reviewed Apr 22, 2016
View reviewed changes

Set default size for ObjectType.

6b6c12d

viirya changed the title ~~[SPARK-14838][SQL] Implement statistics in SerializeFromObject to avoid failure when estimating sizeInBytes for ObjectType~~ [SPARK-14838][SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer Apr 23, 2016

rxin reviewed Apr 24, 2016
View reviewed changes

Correct test name.

c26c3bd

Remove unnecessary import.

471d9ab

asfgit closed this in ba5e0b8 Apr 24, 2016

viirya deleted the skip-broadcast-objectproducer branch December 27, 2023 18:33

[SPARK-14838][SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer #12599

[SPARK-14838][SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer #12599

Uh oh!

Conversation

viirya commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 22, 2016

Uh oh!

viirya commented Apr 22, 2016

Uh oh!

cloud-fan commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Apr 22, 2016

Uh oh!

gatorsmile commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 22, 2016

Uh oh!

viirya commented Apr 22, 2016

Uh oh!

cloud-fan commented Apr 22, 2016

Uh oh!

viirya commented Apr 22, 2016

Uh oh!

viirya commented Apr 22, 2016

Uh oh!

gatorsmile commented Apr 22, 2016

Uh oh!

gatorsmile commented Apr 22, 2016

Uh oh!

viirya commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Apr 22, 2016

Uh oh!

gatorsmile commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

gatorsmile commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

viirya commented Apr 22, 2016 •

edited

Loading

cloud-fan commented Apr 22, 2016 •

edited

Loading

gatorsmile commented Apr 22, 2016 •

edited

Loading

viirya commented Apr 22, 2016 •

edited

Loading

gatorsmile commented Apr 22, 2016 •

edited

Loading

cloud-fan commented Apr 22, 2016 •

edited

Loading

davies Apr 22, 2016 •

edited

Loading

davies Apr 23, 2016 •

edited

Loading