[SPARK-12543] [SPARK-4226] [SQL] Subquery in expression #10706

davies · 2016-01-11T20:47:24Z

This PR brings subquery for expression (use subquery as an expression inside SELECT/WHERE/HAVING), for example:

  select (select 1) as a
  select * from t where a = (select max(b) from t2)
  select * from t where exists (select * from t2 where t2.b = t.b)
  select max(a) as ma from t having ma = (select max(b) from t2)

A subquery could be uncorrelated or correlated, it could be scalar subquery (returns single row and single column) or not (returns multiple rows, for EXISTS or IN).

A correlated subquery or uncorrelated subquery that returns multiple rows will be rewritten as JOIN. Scalar subquery will be executed separately, result will be filled into the expression.

Restrictions:

EXISTS/ NOT EXISTS/IN/NOT IN can only be used as top level condition of WHERE/HAVING, can't be used in OR or others.
a NOT IN subquery is not supported, if subquery is nullable, which requires a null-aware semi join.
Correlated subquery can only be used inside WHERE/HAVING, uncorrelated scalar subquery could be used anywhere.

SparkQA · 2016-01-11T21:55:14Z

Test build #49172 has finished for PR 10706 at commit 7ae30e1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-12T06:37:02Z

Test build #2367 has finished for PR 10706 at commit 7ae30e1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-12T08:50:22Z

Test build #49232 has finished for PR 10706 at commit 2d88bb3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-12T18:34:32Z

Test build #49247 has finished for PR 10706 at commit cd14e20.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
- class FlumeEventServer(receiver: FlumeReceiver) extends AvroSourceProtocol
- class UnsafeCartesianRDD(left: RDD[UnsafeRow], right: RDD[UnsafeRow], numFieldsOfRight: Int)
- case class JdbcType(databaseTypeDefinition: String, jdbcNullType: Int)
- class ObjectInputStreamWithLoader(_inputStream: InputStream, loader: ClassLoader)
- class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
- abstract class InputDStream[T: ClassTag] (_ssc: StreamingContext)
- abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)

chenghao-intel · 2016-01-13T07:43:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/SemiJoinSuite.scala

This is actually what I am struggling, if the join key is null, then will we let's go into the anti join result?

SparkQA · 2016-01-13T07:43:59Z

Test build #2374 has finished for PR 10706 at commit aa33df0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
- class FlumeEventServer(receiver: FlumeReceiver) extends AvroSourceProtocol
- class UnsafeCartesianRDD(left: RDD[UnsafeRow], right: RDD[UnsafeRow], numFieldsOfRight: Int)
- case class JdbcType(databaseTypeDefinition: String, jdbcNullType: Int)
- class ObjectInputStreamWithLoader(_inputStream: InputStream, loader: ClassLoader)
- class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
- abstract class InputDStream[T: ClassTag] (_ssc: StreamingContext)
- abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)

SparkQA · 2016-01-13T07:45:29Z

Test build #49302 has finished for PR 10706 at commit aa33df0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
- class FlumeEventServer(receiver: FlumeReceiver) extends AvroSourceProtocol
- class UnsafeCartesianRDD(left: RDD[UnsafeRow], right: RDD[UnsafeRow], numFieldsOfRight: Int)
- case class JdbcType(databaseTypeDefinition: String, jdbcNullType: Int)
- class ObjectInputStreamWithLoader(_inputStream: InputStream, loader: ClassLoader)
- class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
- abstract class InputDStream[T: ClassTag] (_ssc: StreamingContext)
- abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)

SparkQA · 2016-01-13T10:11:09Z

Test build #49306 has finished for PR 10706 at commit d1feebd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-01-13T20:13:38Z

@chenghao-intel Since you also worked on this topic, could you take a look a this?

It's good to turn on those Hive compatibility tests, but I'm stucked by how to generated the golden results. I tried to pull in yours in #9055, but unfortunately some of those queries are not in good state (missing spaces?).

chenghao-intel · 2016-01-14T01:02:27Z

Yes, I can build a PR against your branch. :), but for anti join, I have another PR #10563, some of my concerns are being discussed, one key issue is about the null join key, I think we probably need to filter out the null in predicate expression, and it will be great if we can figure out that issue first.

davies · 2016-01-14T01:32:00Z

@chenghao-intel This PR does not have a null-aware anti join, it will fail some query when analyzing, we could add that later.

davies · 2016-02-11T19:09:39Z

@hvanhovell Can you help to look at this one? I'd like to split this out as small PRs, hopefully we can merge part of them into 2.0.

hvanhovell · 2016-02-11T19:47:31Z

@davies sure, no problem.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala

SparkQA · 2016-02-12T22:03:48Z

Test build #51206 has finished for PR 10706 at commit a3fccab.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansModel(JavaModel, MLWritable, MLReadable):
- class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol, HasSeed,
- class BisectingKMeansModel(JavaModel):
- class BisectingKMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasSeed):
- class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, HasRegParam, HasSeed,
- class ALSModel(JavaModel, MLWritable, MLReadable):

rxin · 2016-02-20T06:12:14Z

Can we close this first, and create a new one when we get to correlated subqueries?

davies · 2016-02-20T06:24:42Z

I'd like to keep this open so I can easily find this branch.

### What changes were proposed in this pull request? This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys. We currently add support for the following SQL join syntax: SELECT * FROM tbl1 A LEFT ANTI JOIN tbl2 B ON A.Id = B.Id Or using a dataframe: tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti) This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator. The PR has been (losely) based on PR's by both davies (#10706) and chenghao-intel (#10563); credit should be given where credit is due. This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`. ### How was this patch tested? Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from #10563. cc davies chenghao-intel rxin Author: Herman van Hovell <[email protected]> Closes #12214 from hvanhovell/SPARK-12610.

### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (#10706) and chenghao-intel (#9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <[email protected]> Closes #12306 from hvanhovell/SPARK-4226.

imperio-wxm · 2016-04-25T01:43:22Z

Hi,I encountered a similar problem.(spark:1.5.2)

Subquery like this:

SELECT * FROM his_data_zadd WHERE value=(SELECT MAX(his_t.value) FROM his_data_zadd AS his_t)

Error code:

py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql.
: java.lang.RuntimeException: [1.49] failure: ``)'' expected but identifier MAX found

SELECT * FROM his_data_zadd WHERE value=(SELECT MAX(his_t.value) FROM his_data_zadd AS his_t)
                                                ^

How should I write correctly subquery?

davies · 2016-04-25T01:53:36Z

Spark 1.5 does not support subquery.

imperio-wxm · 2016-04-25T02:07:30Z

Thanks.

kamleshkarath · 2016-06-08T04:18:18Z

Hi Davies,

Could you please shed more light on the status of correlated but non-scalar subquery in Spark 2.0 release. Appreciate if you can summarize any other restrictions, if any.

Query:

Select
runon as runon,
case
when key in (Select key from sqltesttable where group = 'vowels') then 'vowels'
else 'consonants'
end as group,
key as key,
someint as someint
from sqltesttable;

Error:

Error in SQL statement: AnalysisException: Predicate sub-queries can only be used in a Filter: Project [runon#4031 AS runon#4026,CASE WHEN predicate-subquery#4027 [(key#4033 = key#4037)] THEN vowels ELSE consonants END AS group#4028,key#4033 AS key#4029,someint#4034 AS someint#4030]
: +- SubqueryAlias predicate-subquery#4027 [(key#4033 = key#4037)]
: +- Project [key#4037]
: +- Filter (group#4036 = vowels)
: +- MetastoreRelation default, sqltesttable, None
+- MetastoreRelation default, sqltesttable, None
;

davies · 2016-06-08T06:52:13Z

predicate subquery (IN, EXISTS) in SELECT is not supported in 2.0, only supported in WHERE/HAVING.

kamleshkarath · 2016-06-13T02:00:40Z

Thank you ! Any alternative options to use instead of predicate subquery ?
Is it in plan for amendment in 2.0?

Select
runon as runon,
case
when key in (Select key from sqltesttable where group = 'vowels') then 'vowels'
else 'consonants'
end as group,
key as key,
someint as someint
from sqltesttable;

hvanhovell · 2016-06-13T03:20:46Z

@kamalcoursera we are very close to a Spark 2.0 release. This will not be added.

However you could use a predicate scalar subquery here, i.e.:

select   runon as runon
         case
          when (select max(true) from sqltesttable b where b.key = a.key and group = 'vowels') then 'vowels'
          else 'consonants'
         end as group,
         key as key,
         someint as someint
from     sqltesttable a;

Davies Liu added 6 commits January 10, 2016 00:57

parse subquery

dfe501a

add left anti join

b855816

support exists / not exists

9918c0c

support in subquery

37f22e2

support correlated scalar subquery

fde2461

support scalar subquery

7ae30e1

improve corner cases

2d88bb3

update style

6809ad5

davies changed the title ~~[SPARK-12543] [SQL] [WIP] Subquery in expression~~ [SPARK-12543] [SPARK-4226] [SQL] [WIP] Subquery in expression Jan 13, 2016

Merge branch 'master' of github.com:apache/spark into parse_subquery

aa33df0

davies force-pushed the parse_subquery branch from cd14e20 to aa33df0 Compare January 13, 2016 07:13

chenghao-intel reviewed Jan 13, 2016
View reviewed changes

add more tests

d1feebd

davies changed the title ~~[SPARK-12543] [SPARK-4226] [SQL] [WIP] Subquery in expression~~ [SPARK-12543] [SPARK-4226] [SQL] Subquery in expression Jan 13, 2016

add Hive tests

56d140f

chenghao-intel mentioned this pull request Jan 14, 2016

[SPARK-12610][SQL] Add Anti join operators #10563

Closed

Davies Liu added 2 commits February 12, 2016 13:32

Merge branch 'master' of github.com:apache/spark into parse_subquery

a3fccab

davies closed this Feb 20, 2016

hvanhovell mentioned this pull request Apr 6, 2016

[SPARK-12610][SQL] Left Anti Join #12214

Closed

hvanhovell mentioned this pull request Apr 11, 2016

[SPARK-4226][SQL] Support IN/EXISTS Subqueries #12306

Closed

[SPARK-12543] [SPARK-4226] [SQL] Subquery in expression #10706

[SPARK-12543] [SPARK-4226] [SQL] Subquery in expression #10706

Uh oh!

Conversation

davies commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

chenghao-intel Jan 13, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

davies commented Jan 13, 2016

Uh oh!

chenghao-intel commented Jan 14, 2016

Uh oh!

davies commented Jan 14, 2016

Uh oh!

davies commented Feb 11, 2016

Uh oh!

hvanhovell commented Feb 11, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

rxin commented Feb 20, 2016

Uh oh!

davies commented Feb 20, 2016

Uh oh!

imperio-wxm commented Apr 25, 2016

Uh oh!

davies commented Apr 25, 2016

Uh oh!

imperio-wxm commented Apr 25, 2016

Uh oh!

kamleshkarath commented Jun 8, 2016

Uh oh!

davies commented Jun 8, 2016

Uh oh!

kamleshkarath commented Jun 13, 2016

Uh oh!

hvanhovell commented Jun 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hvanhovell commented Jun 13, 2016 •

edited

Loading