-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12543] [SPARK-4226] [SQL] Subquery in expression #10706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #49172 has finished for PR 10706 at commit
|
|
Test build #2367 has finished for PR 10706 at commit
|
|
Test build #49232 has finished for PR 10706 at commit
|
|
Test build #49247 has finished for PR 10706 at commit
|
cd14e20 to
aa33df0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually what I am struggling, if the join key is null, then will we let's go into the anti join result?
|
Test build #2374 has finished for PR 10706 at commit
|
|
Test build #49302 has finished for PR 10706 at commit
|
|
Test build #49306 has finished for PR 10706 at commit
|
|
@chenghao-intel Since you also worked on this topic, could you take a look a this? It's good to turn on those Hive compatibility tests, but I'm stucked by how to generated the golden results. I tried to pull in yours in #9055, but unfortunately some of those queries are not in good state (missing spaces?). |
|
Yes, I can build a PR against your branch. :), but for |
|
@chenghao-intel This PR does not have a null-aware anti join, it will fail some query when analyzing, we could add that later. |
|
@hvanhovell Can you help to look at this one? I'd like to split this out as small PRs, hopefully we can merge part of them into 2.0. |
|
@davies sure, no problem. |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala
|
Test build #51206 has finished for PR 10706 at commit
|
|
Can we close this first, and create a new one when we get to correlated subqueries? |
|
I'd like to keep this open so I can easily find this branch. |
### What changes were proposed in this pull request?
This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys.
We currently add support for the following SQL join syntax:
SELECT *
FROM tbl1 A
LEFT ANTI JOIN tbl2 B
ON A.Id = B.Id
Or using a dataframe:
tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti)
This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator.
The PR has been (losely) based on PR's by both davies (#10706) and chenghao-intel (#10563); credit should be given where credit is due.
This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`.
### How was this patch tested?
Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from #10563.
cc davies chenghao-intel rxin
Author: Herman van Hovell <[email protected]>
Closes #12214 from hvanhovell/SPARK-12610.
### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (#10706) and chenghao-intel (#9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <[email protected]> Closes #12306 from hvanhovell/SPARK-4226.
|
Hi,I encountered a similar problem.(spark:1.5.2) Subquery like this: SELECT * FROM his_data_zadd WHERE value=(SELECT MAX(his_t.value) FROM his_data_zadd AS his_t)Error code: py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql.
: java.lang.RuntimeException: [1.49] failure: ``)'' expected but identifier MAX found
SELECT * FROM his_data_zadd WHERE value=(SELECT MAX(his_t.value) FROM his_data_zadd AS his_t)
^How should I write correctly subquery? |
|
Spark 1.5 does not support subquery. |
|
Thanks. |
|
Hi Davies, Could you please shed more light on the status of correlated but non-scalar subquery in Spark 2.0 release. Appreciate if you can summarize any other restrictions, if any. Query: Select Error: Error in SQL statement: AnalysisException: Predicate sub-queries can only be used in a Filter: Project [runon#4031 AS runon#4026,CASE WHEN predicate-subquery#4027 [(key#4033 = key#4037)] THEN vowels ELSE consonants END AS group#4028,key#4033 AS key#4029,someint#4034 AS someint#4030] |
|
predicate subquery (IN, EXISTS) in SELECT is not supported in 2.0, only supported in WHERE/HAVING. |
|
Thank you ! Any alternative options to use instead of predicate subquery ? Select |
|
@kamalcoursera we are very close to a Spark 2.0 release. This will not be added. However you could use a predicate scalar subquery here, i.e.: select runon as runon
case
when (select max(true) from sqltesttable b where b.key = a.key and group = 'vowels') then 'vowels'
else 'consonants'
end as group,
key as key,
someint as someint
from sqltesttable a; |
This PR brings subquery for expression (use subquery as an expression inside SELECT/WHERE/HAVING), for example:
A subquery could be uncorrelated or correlated, it could be scalar subquery (returns single row and single column) or not (returns multiple rows, for EXISTS or IN).
A correlated subquery or uncorrelated subquery that returns multiple rows will be rewritten as JOIN. Scalar subquery will be executed separately, result will be filled into the expression.
Restrictions: