[SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs #10296

liancheng · 2015-12-14T17:49:15Z

This PR fixes several nullability bugs found while investigationg SPARK-12323.

liancheng · 2015-12-14T17:49:25Z

cc @cloud-fan

cloud-fan · 2015-12-15T01:00:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

Should we assign a default value for it? Or I'm afraid it can't compile...

This should work under standard Java. Will double check how Janino behaves though.

Update: it works as expected.

cloud-fan · 2015-12-15T01:03:36Z

retest this please.

cloud-fan · 2015-12-15T01:10:22Z

For nested fields, how about improving the GetStructField?

liancheng · 2015-12-15T01:31:04Z

All ExtractValue expressions should be fixed in similar way.

SparkQA · 2015-12-15T01:42:00Z

Test build #47696 has finished for PR 10296 at commit de8b442.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-15T11:13:01Z

Test build #47727 has finished for PR 10296 at commit fe11ed1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-12-15T11:16:05Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

This fixes SPARK-12335.

liancheng · 2015-12-15T11:17:28Z

Didn't add separate test cases for SPARK-12335 and SPARK-12336 since they were caught by existing test cases.

liancheng · 2015-12-15T11:22:57Z

test this please

cloud-fan · 2015-12-15T12:19:37Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

can we set the corrected join type here?

nvm, seems no need.

Good catch, thanks!

SparkQA · 2015-12-15T12:36:13Z

Test build #47729 has finished for PR 10296 at commit 43e5743.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-12-15T12:41:53Z

Hm, seems that this change reveals a lot of other existing nullability bugs...

SparkQA · 2015-12-15T12:49:05Z

Test build #47730 has finished for PR 10296 at commit 43e5743.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-15T12:50:29Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

joined.copy(condition = condition)?

cloud-fan · 2015-12-15T12:58:12Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

should we rename this to joinedColsFromRight?

I think it's OK. The first line of the comment above already explained the purpose of this variable.

SparkQA · 2015-12-15T14:10:57Z

Test build #47736 has finished for PR 10296 at commit fa98ae6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-15T14:55:15Z

Test build #47737 has finished for PR 10296 at commit cfcc6df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-15T15:33:43Z

LGTM, can we fix the nested fields in a follow-up PR?

liancheng · 2015-12-16T01:44:53Z

@cloud-fan Yeah, that's the plan. Thanks for the review!

marmbrus · 2015-12-16T02:07:20Z

Hey guys, I think all this nullability clean up is great, but I afraid that the changes to BoundReference are in conflict with something @nongli and @davies are trying to accomplish. Basically, for performance reasons we should be able to skip null checks for columns that are not nullable since bitset operations and branches are pretty expensive. If we are trying to solve SPARK-12323 then I think the right place to do it is probably in NewInstance.

marmbrus · 2015-12-16T02:08:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

I think ideally we would not do this check at all when nullable = false.

liancheng · 2015-12-16T02:35:59Z

@marmbrus I also found that NewInstance and MapObjects need to be updated for complete runtime nullability checking, especially for nested fields. As stated in the PR description, planning to fix this part in a follow-up PR.

So how about updating this PR to only fix all the nullability mismatches without touching BoundReference, and then continue fixing NewInstance etc. in another one?

marmbrus · 2015-12-16T02:37:50Z

That sounds good to me.

SparkQA · 2015-12-16T03:15:00Z

Test build #47770 has finished for PR 10296 at commit 05c36e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class WrapOption(child: Expression, optType: DataType)\n

davies · 2015-12-16T03:33:00Z

@liancheng I currently working on nullability of expressions, could you hold this PR a little bit?

liancheng · 2015-12-16T04:41:56Z

@davies As commented above, I'll reshape this PR to only fix those wrong nullability issues without touching BoundReference, so that it won't conflict with you changes.

liancheng · 2015-12-16T09:34:27Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Changes in this file fixes SPARK-12336.

SparkQA · 2015-12-16T10:39:47Z

Test build #47808 has finished for PR 10296 at commit 7e35e37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-16T11:01:21Z

Test build #47807 has finished for PR 10296 at commit d540b86.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class WrapOption(child: Expression, optType: DataType)\n

liancheng · 2015-12-16T12:33:45Z

retest this please

The last build failure seems to be irrelevant.

liancheng · 2015-12-16T17:04:52Z

@marmbrus Reshaped this PR to only fix those nullability bugs.

After some more investigation, now I don't think we can resolve SPARK-12323 by fixing NewInstance. The reasons are:

ExpressionEncoders are always created using reflection based schema inference, which implies that the only non-nullable fields within a fromRowExpression are those of unboxed primitive types.
Unboxed primitive fields are always retrieved using code generated in BoundReference rathar than NewInstance, since NewInstance is only used to build objects.

Since we would like to avoid per row runtime null checking and branching cost (what @davies and @nongli are working on), we'll have to assume the nullability of input data always match the schema of the ExpressionEncoder being used. Another not quite appealing choice is to add an option to generate code with null checking, so that users can use it for debugging purposes.

On the other hand, we can and should ensure nullability of the underlying logical plan is consistent with the Dataset while constructing a Dataset. For example, currently the following case works:

val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null)))
val schema = StructType(Seq(StructField("_1", StringType, nullable = false)))
val df = sqlContext.createDataFrame(rowRDD, schema)
df.as[Tuple1[String]].collect().foreach(println)

// Output:
//
//   (hello)
//   (null)

This analysis time checking can be done in ExpressionEncoder.resolve by comparing schemata of the logical plan and the encoder. Opened PR #10331 for this check.

marmbrus · 2015-12-16T18:58:18Z

After some more investigation, now I don't think we can resolve SPARK-12323 by fixing NewInstance.

You could augment NewInstance to understand which arguments are primitive types or you could infer this using reflection on the constructor. Though I think the resulting expression tree would be clearer if you just created the following new expression and inserted it into the tree for primitive types that are the children of a NewInstance.

case class AssertNotNull(path: String, child: Expression) ...

I think there may be some confusion about the schema guarantees for encoders and their expressions. When there is a primitive type, the corresponding toRowExpressions will be non-nullable, since the data is coming from an object that can't possible store a null value. In contrast, the fromRowExpression are reading from arbitrary input data, and thus their nullablity has nothing to do with the structure of the target object. Instead this information is coming from the schema that the encoder is resolved/bound to. Therefore, using the nullable bit here is incorrect. nullable should always be thought of as a promise about the input data that we can use for optimization, not a constraint enforcement mechanism.

Another not quite appealing choice is to add an option to generate code with null checking, so that users can use it for debugging purposes.

I think it would actually be very good to have assertions that null data does not appear where it is not expected. When we actually start using this information we are almost certainly going to find more places we are not propagating the information correctly. However, we need to ensure that these are elided in production to avoid invalidating the optimization this information is supposed to enable.

This analysis time checking can be done in ExpressionEncoder.resolve by comparing schemata of the logical plan and the encoder.

I don't agree that you can verify the problem with this code statically. Creating a schema that says that _1 is not nullable is valid, even though a String can be null in the JVM. Again, this is a promise about the data itself, so you can't assert that there is a problem until that promise is violated and you see a null value in this column.

That said, trusting the user to get this right has led to confusion in the past. So I would propose that we do add validations at the sqlContext.createDataX boundaries. Internally, though, we should trust this bit so that we can avoid unnecessary/expensive null checks.

liancheng · 2015-12-18T09:16:30Z

Closing this one since PR #10333 already covers all the nullability bugs fixed in this one.

liancheng · 2015-12-21T16:46:02Z

@marmbrus Thanks a lot for the detailed explanation! According to your suggestion, I've added AssertNotNull in PR #10331.

cloud-fan reviewed Dec 15, 2015
View reviewed changes

liancheng changed the title ~~[SPARK-12323][SQL] Makes BoundReference respect nullability~~ [SPARK-12323][SPARK-12335][SQL] Makes BoundReference respect nullability Dec 15, 2015

liancheng changed the title ~~[SPARK-12323][SPARK-12335][SQL] Makes BoundReference respect nullability~~ [SPARK-12323][SPARK-12335][SPARK-12336][SQL] Makes BoundReference respect nullability Dec 15, 2015

liancheng reviewed Dec 15, 2015
View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

Copy link

Contributor Author

liancheng Dec 15, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes SPARK-12335.

cloud-fan reviewed Dec 15, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala Outdated

Copy link

Contributor

cloud-fan Dec 15, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joined.copy(condition = condition)?

liancheng force-pushed the spark-12323.non-nullable-ds-fields branch from 1d67016 to f9638b7 Compare December 15, 2015 12:56

cloud-fan reviewed Dec 15, 2015
View reviewed changes

liancheng changed the title ~~[SPARK-12323][SPARK-12335][SPARK-12336][SQL] Makes BoundReference respect nullability~~ [SPARK-12323][SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Makes BoundReference respect nullability Dec 15, 2015

liancheng added 6 commits December 16, 2015 09:24

Makes BoundReference respect nullability

b2ea6aa

Makes CentralMomentAgg nullable

7b1600c

Fixes SPARK-12336

b84eb6a

Comments for SPARK-12336

02ad870

Addresses comments

aa968e6

Fixes SPARK-12341

d84478e

Fixes SPARK-12342

05c36e5

liancheng force-pushed the spark-12323.non-nullable-ds-fields branch from cfcc6df to 05c36e5 Compare December 16, 2015 01:27

marmbrus reviewed Dec 16, 2015
View reviewed changes

Reverts BoundReference changes

d540b86

liancheng changed the title ~~[SPARK-12323][SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Makes BoundReference respect nullability~~ [SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs Dec 16, 2015

liancheng reviewed Dec 16, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Copy link

Contributor Author

liancheng Dec 16, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file fixes SPARK-12336.

Test case for SPARK-12336

7e35e37

liancheng mentioned this pull request Dec 17, 2015

[SPARK-12371][SQL] Runtime nullability check for NewInstance #10331

Closed

liancheng closed this Dec 18, 2015

liancheng deleted the spark-12323.non-nullable-ds-fields branch December 18, 2015 09:28

[SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs #10296

[SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs #10296

Uh oh!

Conversation

liancheng commented Dec 14, 2015

Uh oh!

liancheng commented Dec 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 15, 2015

Uh oh!

cloud-fan commented Dec 15, 2015

Uh oh!

liancheng commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Dec 15, 2015

Uh oh!

liancheng commented Dec 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

liancheng commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

cloud-fan commented Dec 15, 2015

Uh oh!

liancheng commented Dec 16, 2015

Uh oh!

marmbrus commented Dec 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Dec 16, 2015

Uh oh!

marmbrus commented Dec 16, 2015

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

davies commented Dec 16, 2015

Uh oh!

liancheng commented Dec 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

liancheng commented Dec 16, 2015

Uh oh!

liancheng commented Dec 16, 2015

Uh oh!

marmbrus commented Dec 16, 2015

Uh oh!