-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs #10296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs #10296
Conversation
|
cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assign a default value for it? Or I'm afraid it can't compile...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work under standard Java. Will double check how Janino behaves though.
Update: it works as expected.
|
retest this please. |
|
For nested fields, how about improving the |
|
All |
|
Test build #47696 has finished for PR 10296 at commit
|
|
Test build #47727 has finished for PR 10296 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes SPARK-12335.
|
Didn't add separate test cases for SPARK-12335 and SPARK-12336 since they were caught by existing test cases. |
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we set the corrected join type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, seems no need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
|
Test build #47729 has finished for PR 10296 at commit
|
|
Hm, seems that this change reveals a lot of other existing nullability bugs... |
|
Test build #47730 has finished for PR 10296 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joined.copy(condition = condition)?
1d67016 to
f9638b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename this to joinedColsFromRight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK. The first line of the comment above already explained the purpose of this variable.
|
Test build #47736 has finished for PR 10296 at commit
|
|
Test build #47737 has finished for PR 10296 at commit
|
|
LGTM, can we fix the nested fields in a follow-up PR? |
cfcc6df to
05c36e5
Compare
|
@cloud-fan Yeah, that's the plan. Thanks for the review! |
|
Hey guys, I think all this nullability clean up is great, but I afraid that the changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we would not do this check at all when nullable = false.
|
@marmbrus I also found that So how about updating this PR to only fix all the nullability mismatches without touching |
|
That sounds good to me. |
|
Test build #47770 has finished for PR 10296 at commit
|
|
@liancheng I currently working on nullability of expressions, could you hold this PR a little bit? |
|
@davies As commented above, I'll reshape this PR to only fix those wrong nullability issues without touching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes in this file fixes SPARK-12336.
|
Test build #47808 has finished for PR 10296 at commit
|
|
Test build #47807 has finished for PR 10296 at commit
|
|
retest this please The last build failure seems to be irrelevant. |
|
@marmbrus Reshaped this PR to only fix those nullability bugs. After some more investigation, now I don't think we can resolve SPARK-12323 by fixing
Since we would like to avoid per row runtime null checking and branching cost (what @davies and @nongli are working on), we'll have to assume the nullability of input data always match the schema of the On the other hand, we can and should ensure nullability of the underlying logical plan is consistent with the Dataset while constructing a Dataset. For example, currently the following case works: val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null)))
val schema = StructType(Seq(StructField("_1", StringType, nullable = false)))
val df = sqlContext.createDataFrame(rowRDD, schema)
df.as[Tuple1[String]].collect().foreach(println)
// Output:
//
// (hello)
// (null)This analysis time checking can be done in |
You could augment case class AssertNotNull(path: String, child: Expression) ...I think there may be some confusion about the schema guarantees for encoders and their expressions. When there is a primitive type, the corresponding
I think it would actually be very good to have assertions that null data does not appear where it is not expected. When we actually start using this information we are almost certainly going to find more places we are not propagating the information correctly. However, we need to ensure that these are elided in production to avoid invalidating the optimization this information is supposed to enable.
I don't agree that you can verify the problem with this code statically. Creating a schema that says that That said, trusting the user to get this right has led to confusion in the past. So I would propose that we do add validations at the |
|
Closing this one since PR #10333 already covers all the nullability bugs fixed in this one. |
This PR fixes several nullability bugs found while investigationg SPARK-12323.