Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR allows the Project node to host outer references in scalar subqueries when decorrelateInnerQuery is enabled. It is already supported by the new decorrelation framework and the RewriteCorrelatedScalarSubquery rule.

Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider SELECT (SELECT c1) FROM t. This should be optimized as SELECT c1 FROM t instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation.

Why are the changes needed?

To allow more types of correlated scalar subqueries.

Does this PR introduce any user-facing change?

Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example:

SELECT (SELECT c1) FROM t;

Before this change:

org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported 
outside of WHERE/HAVING clauses

After this change:

+------------------+
|scalarsubquery(c1)|
+------------------+
|0                 |
|1                 |
+------------------+

How was this patch tested?

Added unit tests and SQL tests.

@github-actions github-actions bot added the SQL label Jul 6, 2021
@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45228/

@SparkQA
Copy link

SparkQA commented Jul 7, 2021

Test build #140717 has finished for PR 33235 at commit 9577d90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ca348e5 Jul 7, 2021
allisonwang-db added a commit to allisonwang-db/spark that referenced this pull request Jul 27, 2021
…ubqueries

This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule.

Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation.

To allow more types of correlated scalar subqueries.

Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example:
```sql
SELECT (SELECT c1) FROM t;
```
Before this change:
```
org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported
outside of WHERE/HAVING clauses
```

After this change:
```
+------------------+
|scalarsubquery(c1)|
+------------------+
|0                 |
|1                 |
+------------------+
```

Added unit tests and SQL tests.

Closes apache#33235 from allisonwang-db/spark-36028-outer-in-project.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ca348e5)
Signed-off-by: allisonwang-db <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 28, 2021
…lar subqueries

This PR cherry picks #33235 to branch-3.2 to fix test failures introduced by #33284.

### What changes were proposed in this pull request?
This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule.

Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation.

### Why are the changes needed?
To allow more types of correlated scalar subqueries.

### Does this PR introduce _any_ user-facing change?
Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example:
```sql
SELECT (SELECT c1) FROM t;
```
Before this change:
```
org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported
outside of WHERE/HAVING clauses
```

After this change:
```
+------------------+
|scalarsubquery(c1)|
+------------------+
|0                 |
|1                 |
+------------------+
```

### How was this patch tested?
Added unit tests and SQL tests.

(cherry picked from commit ca348e5)
Signed-off-by: allisonwang-db <allison.wangdatabricks.com>

Closes #33527 from allisonwang-db/spark-36028-3.2.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@allisonwang-db allisonwang-db deleted the spark-36028-outer-in-project branch January 19, 2024 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants