Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jan 11, 2016

This PR brings subquery for expression (use subquery as an expression inside SELECT/WHERE/HAVING), for example:

  select (select 1) as a
  select * from t where a = (select max(b) from t2)
  select * from t where exists (select * from t2 where t2.b = t.b)
  select max(a) as ma from t having ma = (select max(b) from t2)

A subquery could be uncorrelated or correlated, it could be scalar subquery (returns single row and single column) or not (returns multiple rows, for EXISTS or IN).

A correlated subquery or uncorrelated subquery that returns multiple rows will be rewritten as JOIN. Scalar subquery will be executed separately, result will be filled into the expression.

Restrictions:

  1. EXISTS/ NOT EXISTS/IN/NOT IN can only be used as top level condition of WHERE/HAVING, can't be used in OR or others.
  2. a NOT IN subquery is not supported, if subquery is nullable, which requires a null-aware semi join.
  3. Correlated subquery can only be used inside WHERE/HAVING, uncorrelated scalar subquery could be used anywhere.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49172 has finished for PR 10706 at commit 7ae30e1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #2367 has finished for PR 10706 at commit 7ae30e1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49232 has finished for PR 10706 at commit 2d88bb3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49247 has finished for PR 10706 at commit cd14e20.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
    • class FlumeEventServer(receiver: FlumeReceiver) extends AvroSourceProtocol
    • class UnsafeCartesianRDD(left: RDD[UnsafeRow], right: RDD[UnsafeRow], numFieldsOfRight: Int)
    • case class JdbcType(databaseTypeDefinition: String, jdbcNullType: Int)
    • class ObjectInputStreamWithLoader(_inputStream: InputStream, loader: ClassLoader)
    • class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
    • abstract class InputDStream[T: ClassTag] (_ssc: StreamingContext)
    • abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)

@davies davies changed the title [SPARK-12543] [SQL] [WIP] Subquery in expression [SPARK-12543] [SPARK-4226] [SQL] [WIP] Subquery in expression Jan 13, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually what I am struggling, if the join key is null, then will we let's go into the anti join result?

@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #2374 has finished for PR 10706 at commit aa33df0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
    • class FlumeEventServer(receiver: FlumeReceiver) extends AvroSourceProtocol
    • class UnsafeCartesianRDD(left: RDD[UnsafeRow], right: RDD[UnsafeRow], numFieldsOfRight: Int)
    • case class JdbcType(databaseTypeDefinition: String, jdbcNullType: Int)
    • class ObjectInputStreamWithLoader(_inputStream: InputStream, loader: ClassLoader)
    • class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
    • abstract class InputDStream[T: ClassTag] (_ssc: StreamingContext)
    • abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)

@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #49302 has finished for PR 10706 at commit aa33df0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
    • class FlumeEventServer(receiver: FlumeReceiver) extends AvroSourceProtocol
    • class UnsafeCartesianRDD(left: RDD[UnsafeRow], right: RDD[UnsafeRow], numFieldsOfRight: Int)
    • case class JdbcType(databaseTypeDefinition: String, jdbcNullType: Int)
    • class ObjectInputStreamWithLoader(_inputStream: InputStream, loader: ClassLoader)
    • class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
    • abstract class InputDStream[T: ClassTag] (_ssc: StreamingContext)
    • abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)

@davies davies changed the title [SPARK-12543] [SPARK-4226] [SQL] [WIP] Subquery in expression [SPARK-12543] [SPARK-4226] [SQL] Subquery in expression Jan 13, 2016
@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #49306 has finished for PR 10706 at commit d1feebd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Jan 13, 2016

@chenghao-intel Since you also worked on this topic, could you take a look a this?

It's good to turn on those Hive compatibility tests, but I'm stucked by how to generated the golden results. I tried to pull in yours in #9055, but unfortunately some of those queries are not in good state (missing spaces?).

@chenghao-intel
Copy link
Contributor

Yes, I can build a PR against your branch. :), but for anti join, I have another PR #10563, some of my concerns are being discussed, one key issue is about the null join key, I think we probably need to filter out the null in predicate expression, and it will be great if we can figure out that issue first.

@davies
Copy link
Contributor Author

davies commented Jan 14, 2016

@chenghao-intel This PR does not have a null-aware anti join, it will fail some query when analyzing, we could add that later.

@davies
Copy link
Contributor Author

davies commented Feb 11, 2016

@hvanhovell Can you help to look at this one? I'd like to split this out as small PRs, hopefully we can merge part of them into 2.0.

@hvanhovell
Copy link
Contributor

@davies sure, no problem.

Davies Liu added 2 commits February 12, 2016 13:32
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala
@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51206 has finished for PR 10706 at commit a3fccab.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansModel(JavaModel, MLWritable, MLReadable):
    • class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol, HasSeed,
    • class BisectingKMeansModel(JavaModel):
    • class BisectingKMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasSeed):
    • class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, HasRegParam, HasSeed,
    • class ALSModel(JavaModel, MLWritable, MLReadable):

@rxin
Copy link
Contributor

rxin commented Feb 20, 2016

Can we close this first, and create a new one when we get to correlated subqueries?

@davies
Copy link
Contributor Author

davies commented Feb 20, 2016

I'd like to keep this open so I can easily find this branch.

@davies davies closed this Feb 20, 2016
asfgit pushed a commit that referenced this pull request Apr 7, 2016
### What changes were proposed in this pull request?

This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys.

We currently add support for the following SQL join syntax:

    SELECT   *
    FROM      tbl1 A
              LEFT ANTI JOIN tbl2 B
               ON A.Id = B.Id

Or using a dataframe:

    tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti)

This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator.

The PR has been (losely) based on PR's by both davies (#10706) and chenghao-intel (#10563); credit should be given where credit is due.

This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`.

### How was this patch tested?

Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from #10563.

cc davies chenghao-intel rxin

Author: Herman van Hovell <[email protected]>

Closes #12214 from hvanhovell/SPARK-12610.
asfgit pushed a commit that referenced this pull request Apr 19, 2016
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (#10706) and chenghao-intel (#9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <[email protected]>

Closes #12306 from hvanhovell/SPARK-4226.
@imperio-wxm
Copy link

Hi,I encountered a similar problem.(spark:1.5.2)

Subquery like this:

SELECT * FROM his_data_zadd WHERE value=(SELECT MAX(his_t.value) FROM his_data_zadd AS his_t)

Error code:

py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql.
: java.lang.RuntimeException: [1.49] failure: ``)'' expected but identifier MAX found

SELECT * FROM his_data_zadd WHERE value=(SELECT MAX(his_t.value) FROM his_data_zadd AS his_t)
                                                ^

How should I write correctly subquery?

@davies
Copy link
Contributor Author

davies commented Apr 25, 2016

Spark 1.5 does not support subquery.

@imperio-wxm
Copy link

Thanks.

@kamleshkarath
Copy link

Hi Davies,

Could you please shed more light on the status of correlated but non-scalar subquery in Spark 2.0 release. Appreciate if you can summarize any other restrictions, if any.

Query:

Select
runon as runon,
case
when key in (Select key from sqltesttable where group = 'vowels') then 'vowels'
else 'consonants'
end as group,
key as key,
someint as someint
from sqltesttable;

Error:

Error in SQL statement: AnalysisException: Predicate sub-queries can only be used in a Filter: Project [runon#4031 AS runon#4026,CASE WHEN predicate-subquery#4027 [(key#4033 = key#4037)] THEN vowels ELSE consonants END AS group#4028,key#4033 AS key#4029,someint#4034 AS someint#4030]
: +- SubqueryAlias predicate-subquery#4027 [(key#4033 = key#4037)]
: +- Project [key#4037]
: +- Filter (group#4036 = vowels)
: +- MetastoreRelation default, sqltesttable, None
+- MetastoreRelation default, sqltesttable, None
;

@davies
Copy link
Contributor Author

davies commented Jun 8, 2016

predicate subquery (IN, EXISTS) in SELECT is not supported in 2.0, only supported in WHERE/HAVING.

@kamleshkarath
Copy link

Thank you ! Any alternative options to use instead of predicate subquery ?
Is it in plan for amendment in 2.0?

Select
runon as runon,
case
when key in (Select key from sqltesttable where group = 'vowels') then 'vowels'
else 'consonants'
end as group,
key as key,
someint as someint
from sqltesttable;

@hvanhovell
Copy link
Contributor

hvanhovell commented Jun 13, 2016

@kamalcoursera we are very close to a Spark 2.0 release. This will not be added.

However you could use a predicate scalar subquery here, i.e.:

select   runon as runon
         case
          when (select max(true) from sqltesttable b where b.key = a.key and group = 'vowels') then 'vowels'
          else 'consonants'
         end as group,
         key as key,
         someint as someint
from     sqltesttable a;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants