Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR fixes several nullability bugs found while investigationg SPARK-12323.

@liancheng
Copy link
Contributor Author

cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assign a default value for it? Or I'm afraid it can't compile...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work under standard Java. Will double check how Janino behaves though.

Update: it works as expected.

@cloud-fan
Copy link
Contributor

retest this please.

@cloud-fan
Copy link
Contributor

For nested fields, how about improving the GetStructField?

@liancheng
Copy link
Contributor Author

All ExtractValue expressions should be fixed in similar way.

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47696 has finished for PR 10296 at commit de8b442.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng changed the title [SPARK-12323][SQL] Makes BoundReference respect nullability [SPARK-12323][SPARK-12335][SQL] Makes BoundReference respect nullability Dec 15, 2015
@liancheng liancheng changed the title [SPARK-12323][SPARK-12335][SQL] Makes BoundReference respect nullability [SPARK-12323][SPARK-12335][SPARK-12336][SQL] Makes BoundReference respect nullability Dec 15, 2015
@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47727 has finished for PR 10296 at commit fe11ed1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes SPARK-12335.

@liancheng
Copy link
Contributor Author

Didn't add separate test cases for SPARK-12335 and SPARK-12336 since they were caught by existing test cases.

@liancheng
Copy link
Contributor Author

test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set the corrected join type here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, seems no need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47729 has finished for PR 10296 at commit 43e5743.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Hm, seems that this change reveals a lot of other existing nullability bugs...

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47730 has finished for PR 10296 at commit 43e5743.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joined.copy(condition = condition)?

@liancheng liancheng force-pushed the spark-12323.non-nullable-ds-fields branch from 1d67016 to f9638b7 Compare December 15, 2015 12:56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename this to joinedColsFromRight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK. The first line of the comment above already explained the purpose of this variable.

@liancheng liancheng changed the title [SPARK-12323][SPARK-12335][SPARK-12336][SQL] Makes BoundReference respect nullability [SPARK-12323][SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Makes BoundReference respect nullability Dec 15, 2015
@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47736 has finished for PR 10296 at commit fa98ae6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47737 has finished for PR 10296 at commit cfcc6df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, can we fix the nested fields in a follow-up PR?

@liancheng liancheng force-pushed the spark-12323.non-nullable-ds-fields branch from cfcc6df to 05c36e5 Compare December 16, 2015 01:27
@liancheng
Copy link
Contributor Author

@cloud-fan Yeah, that's the plan. Thanks for the review!

@marmbrus
Copy link
Contributor

Hey guys, I think all this nullability clean up is great, but I afraid that the changes to BoundReference are in conflict with something @nongli and @davies are trying to accomplish. Basically, for performance reasons we should be able to skip null checks for columns that are not nullable since bitset operations and branches are pretty expensive. If we are trying to solve SPARK-12323 then I think the right place to do it is probably in NewInstance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we would not do this check at all when nullable = false.

@liancheng
Copy link
Contributor Author

@marmbrus I also found that NewInstance and MapObjects need to be updated for complete runtime nullability checking, especially for nested fields. As stated in the PR description, planning to fix this part in a follow-up PR.

So how about updating this PR to only fix all the nullability mismatches without touching BoundReference, and then continue fixing NewInstance etc. in another one?

@marmbrus
Copy link
Contributor

That sounds good to me.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47770 has finished for PR 10296 at commit 05c36e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class WrapOption(child: Expression, optType: DataType)\n

@davies
Copy link
Contributor

davies commented Dec 16, 2015

@liancheng I currently working on nullability of expressions, could you hold this PR a little bit?

@liancheng
Copy link
Contributor Author

@davies As commented above, I'll reshape this PR to only fix those wrong nullability issues without touching BoundReference, so that it won't conflict with you changes.

@liancheng liancheng changed the title [SPARK-12323][SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Makes BoundReference respect nullability [SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs Dec 16, 2015
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file fixes SPARK-12336.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47808 has finished for PR 10296 at commit 7e35e37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47807 has finished for PR 10296 at commit d540b86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class WrapOption(child: Expression, optType: DataType)\n

@liancheng
Copy link
Contributor Author

retest this please

The last build failure seems to be irrelevant.

@liancheng
Copy link
Contributor Author

@marmbrus Reshaped this PR to only fix those nullability bugs.

After some more investigation, now I don't think we can resolve SPARK-12323 by fixing NewInstance. The reasons are:

  1. ExpressionEncoders are always created using reflection based schema inference, which implies that the only non-nullable fields within a fromRowExpression are those of unboxed primitive types.
  2. Unboxed primitive fields are always retrieved using code generated in BoundReference rathar than NewInstance, since NewInstance is only used to build objects.

Since we would like to avoid per row runtime null checking and branching cost (what @davies and @nongli are working on), we'll have to assume the nullability of input data always match the schema of the ExpressionEncoder being used. Another not quite appealing choice is to add an option to generate code with null checking, so that users can use it for debugging purposes.

On the other hand, we can and should ensure nullability of the underlying logical plan is consistent with the Dataset while constructing a Dataset. For example, currently the following case works:

val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null)))
val schema = StructType(Seq(StructField("_1", StringType, nullable = false)))
val df = sqlContext.createDataFrame(rowRDD, schema)
df.as[Tuple1[String]].collect().foreach(println)

// Output:
//
//   (hello)
//   (null)

This analysis time checking can be done in ExpressionEncoder.resolve by comparing schemata of the logical plan and the encoder. Opened PR #10331 for this check.

@marmbrus
Copy link
Contributor

After some more investigation, now I don't think we can resolve SPARK-12323 by fixing NewInstance.

You could augment NewInstance to understand which arguments are primitive types or you could infer this using reflection on the constructor. Though I think the resulting expression tree would be clearer if you just created the following new expression and inserted it into the tree for primitive types that are the children of a NewInstance.

case class AssertNotNull(path: String, child: Expression) ...

I think there may be some confusion about the schema guarantees for encoders and their expressions. When there is a primitive type, the corresponding toRowExpressions will be non-nullable, since the data is coming from an object that can't possible store a null value. In contrast, the fromRowExpression are reading from arbitrary input data, and thus their nullablity has nothing to do with the structure of the target object. Instead this information is coming from the schema that the encoder is resolved/bound to. Therefore, using the nullable bit here is incorrect. nullable should always be thought of as a promise about the input data that we can use for optimization, not a constraint enforcement mechanism.

Another not quite appealing choice is to add an option to generate code with null checking, so that users can use it for debugging purposes.

I think it would actually be very good to have assertions that null data does not appear where it is not expected. When we actually start using this information we are almost certainly going to find more places we are not propagating the information correctly. However, we need to ensure that these are elided in production to avoid invalidating the optimization this information is supposed to enable.

This analysis time checking can be done in ExpressionEncoder.resolve by comparing schemata of the logical plan and the encoder.

I don't agree that you can verify the problem with this code statically. Creating a schema that says that _1 is not nullable is valid, even though a String can be null in the JVM. Again, this is a promise about the data itself, so you can't assert that there is a problem until that promise is violated and you see a null value in this column.

That said, trusting the user to get this right has led to confusion in the past. So I would propose that we do add validations at the sqlContext.createDataX boundaries. Internally, though, we should trust this bit so that we can avoid unnecessary/expensive null checks.

@liancheng
Copy link
Contributor Author

Closing this one since PR #10333 already covers all the nullability bugs fixed in this one.

@liancheng liancheng closed this Dec 18, 2015
@liancheng liancheng deleted the spark-12323.non-nullable-ds-fields branch December 18, 2015 09:28
@liancheng
Copy link
Contributor Author

@marmbrus Thanks a lot for the detailed explanation! According to your suggestion, I've added AssertNotNull in PR #10331.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants