-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24385][SQL] Resolve self-join condition ambiguity for all BinaryComparisons #21449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm not sure that this behavior should be applied to all binary comparisons. It could result in unexpected behavior in some rare cases. For example: |
|
Test build #91253 has finished for PR 21449 at commit
|
|
@daniel-shields in that case you have 2 different datasets |
|
This case can also occur when the datasets are different but share a common lineage. Consider the following: This currently fails with eqNullSafe, but works with ==. |
|
thanks @daniel-shields , you're right. I am working to check if and how this can be fixed. Thanks for your catch! |
|
Test build #91298 has finished for PR 21449 at commit
|
|
Test build #91303 has finished for PR 21449 at commit
|
|
@mgaido91 I looked at the test failures and I think the changes to the Dataset,resolve method are causing havoc. Consider the Dataset.drop method with the following signature: This may be resulting in columns not getting dropped. I haven't verified but this is the first thing I would check. The change to resolve may be too drastic. I think this same problem occurs in other Dataset methods as well. It may also affect methods in KeyValueGroupedDataset and RelationalGroupedDataset. |
|
yes @daniel-shields, you are right with your analysis. The problem was that we were sometimes using I think this is the only way for addressing the problem described here is to reference which dataset the column is coming from. I think adding a metadata for it is the cleanest way. We may also add a new attribute to the |
|
This is a long-standing issue, I've seen many attempts to fix it (including myself) but no one success. The major problem is, there is no clear definition of the expected behavior, i.e. what's the semantic of some examples Sometime we can use an ancestor's column in a new Dataset but sometimes we can't. We should make the condition clear first. |
|
Test build #91343 has finished for PR 21449 at commit
|
|
Thanks for your comment @cloud-fan. I understand your point. That is quite a tricky problem, since we should know probably also the "DAG" of the dataframes in order to take the right decision. But despite this change is related to that problem, I think it is different and with a much smaller scope. Indeed, while we can use the metadata information in many places, actually in this patch is is used only in the self-join case when there is ambiguity in which column to take. The behavior in any other case in unchanged. So after this patch, the situation in resolving column using |
|
My point is that, we may have a different design if we wanna solve this problem holistically, which may conflict with this patch. We should prove that this is in the right direction and future fix will not conflict with it, or come out with the final fix directly. An example is, we may want to treat |
|
I see what you mean. Honestly I have not thought of a full design for this problem (so I can't state what we should support and what not), but focusing on this specific case I think that:
So I think that in the holistic approach we shouldn't change the current behavior/approach which is present now and will be (IMHO) improved by this patch. What I do think we have to discuss in order not to have to change it - once we want to solve the more generic issue - is the way to track the dataset an attribute is coming from. Here I decided to use the metadata, since I thought this is the cleanest approach. Another approach might be to introduce a new What do you think? |
|
This will definitely not go into 2.3.1, so we have plenty of time. I'll think deeper into it after the spark summit. IMO |
|
Sure, thanks for your time. PS |
|
In the short term we should make the behavior of EqualTo and EqualNullSafe identical. We could do that by adding a case for EqualNullSafe that mirrors that of EqualTo. |
This seems pretty safe and reasonable to me |
|
@daniel-shields do you want to open a PR for that? I'll leave this PR open as it is a more general fix so we can go on with the long-term discussion here in this PR. Do you agree with this approach @cloud-fan ? |
|
I like the proposal by @daniel-shields. If we could get it fixed soon, we will be able to catch up the Spark 2.3.2 release. |
|
ok so I created #21605 for the fix proposed by @daniel-shields. I'd like to leave this open in order to go on with the discussion for a long-term better fix. |
|
@cloud-fan do you have any further comments about this? Thanks. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
In
Dataset.joinwe have a small hack for resolving ambiguity in the column name for self-joins. The current code supports onlyEqualTo, but we may have other conditions involving columns on self joins: in general anyBinaryComparisoncan be specified and faces the same issue.The PR extends the fix to all
BinaryComparisons.How was this patch tested?
added UT