Skip to content

Conversation

@cloud-fan
Copy link
Contributor

This PR is based on #42022 to fix tests, as the PR author is on vacation.

What changes were proposed in this pull request?

In the PR, I propose to add new trait CTEInChildren and mix it to some commands that should have WithCTE on top of their children (queries) instead of main query. Also I modified the CTESubstitution rule and removed special handling of Commands and similar nodes. After the changes, Command, ParsedStatement and InsertIntoDir are handled in the same way as other queries by referring to CTE Defs. Only the difference is in WithCTE is not not placed on the top of main query but on top of command queries.

Closes #41922

Why are the changes needed?

To improve code maintenance. Right now the CTE resolution code path is diverged: query with commands go into CTE inline code path where non-command queries go into CTEDef code path.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running new test:

$ build/sbt "test:testOnly *InsertSuite"

@github-actions github-actions bot added the SQL label Jul 17, 2023
// CTE normally and don't need to force inline.
!commands.head.isInstanceOf[CTEInChildren]
} else if (commands.length > 1) {
// This can happen with the multi-insert statement. We should fall back to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There is duplicated logic here.
To make the code more readable, we can always collect the commands first. If the length of commands is 1, there is a different behavior based on the legacy conf. Otherwise the logic is determined.

* children. There are two reasons:
* 1. Some rules will pattern match the root command nodes, and we should keep command
* as the root node to not break them.
* 2. `Dataset` eagerly executes the commands inside a query plan. However, the CTE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Shall we have an example for the eager execution?

@@ -0,0 +1,33 @@
-- WITH inside CTE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Is there a case will fail before the code change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but the analyzed plan is different, as we always inline CTE before.

@gengliangwang
Copy link
Member

LGTM over

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just minor comments.

copy(child = newLeft, query = newRight)

override def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
withNewChildren(Seq(child, WithCTE(query, cteDefs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just copy(query = WithCTE(... like at other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withNewChildren can copy over the tree node tags.

"legacy behavior which may produce incorrect results because Spark may evaluate a CTE " +
"relation more than once, even if it's nondeterministic.")
.booleanConf
.createWithDefault(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need version here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

*/
trait CTEInChildren extends LogicalPlan {
def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
withNewChildren(children.map(WithCTE(_, cteDefs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to assert that we have only 1 child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to have multiple children, we just duplicate the CTE relations. The current code does not allow it though, and go back to inline CTE.

Copy link
Contributor

@peter-toth peter-toth Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it is always fine to duplicate CTE relations into multiple childrens.
For example, if we have a non-deterministic relation definition and 1-1 reference to it in 2 childrens of CTEInChildren and then here we duplicate the relations into the 2 childrens then I think the InlineCTE rule will decide to inline the relation 2 times, which is not correct.
But I agree with you, I don't see that this could happen now.

@cloud-fan
Copy link
Contributor Author

the test failure is unrelated, I'm merging it to master, thanks for the review!

@cloud-fan cloud-fan closed this in da84f81 Jul 25, 2023
cloud-fan added a commit that referenced this pull request Jun 4, 2024
… original WithCTE node

### What changes were proposed in this pull request?

I noticed an outdated comment in the rule `InlineCTE`
```
      // CTEs in SQL Commands have been inlined by `CTESubstitution` already, so it is safe to add
      // WithCTE as top node here.
```

This is not true anymore after #42036 . It's not a big deal as we replace not-inlined CTE relations with `Repartition` during optimization, so it doesn't matter where we put the `WithCTE` node with not-inlined CTE relations, as it will disappear eventually. But it's still better to keep it at its original place, as third-party rules may be sensitive about the plan shape.

### Why are the changes needed?

to keep the plan shape as much as can after inlining CTE relations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46617 from cloud-fan/cte.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
This PR is based on apache#42022 to fix
tests, as the PR author is on vacation.

### What changes were proposed in this pull request?
In the PR, I propose to add new trait `CTEInChildren` and mix it to some
commands that should have `WithCTE` on top of their children (queries)
instead of main query. Also I modified the `CTESubstitution` rule and
removed special handling of `Command`s and similar nodes. After the
changes, `Command`, `ParsedStatement` and `InsertIntoDir` are handled in
the same way as other queries by referring to CTE Defs. Only the
difference is in `WithCTE` is not not placed on the top of main query
but on top of command queries.

Closes apache#41922

### Why are the changes needed?
To improve code maintenance. Right now the CTE resolution code path is
diverged: query with commands go into CTE inline code path where
non-command queries go into CTEDef code path.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running new test:
```
$ build/sbt "test:testOnly *InsertSuite"
```

Closes apache#42036 from cloud-fan/help.

Lead-authored-by: Max Gekk <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

(cherry picked from commit da84f81)

Co-authored-by: Max Gekk <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
### What changes were proposed in this pull request?
This PR adds recursive query feature to Spark SQL.

A recursive query is defined using the `WITH RECURSIVE` keywords and referring the name of the common table expression within the query.
The implementation complies with SQL standard and follows similar rules to other relational databases:
- A query is made of an anchor followed by a recursive term.
- The anchor terms doesn't contain self reference and it is used to initialize the query.
- The recursive term contains a self reference and it is used to expand the current set of rows with new ones.
- The anchor and recursive terms must be joined with each other by `UNION` or `UNION ALL` operators.
- New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor term). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries.

Please see `cte-recursive.sql` for some examples.

The implemetation has the same limiation that [SPARK-36447](https://issues.apache.org/jira/browse/SPARK-36447) / apache#33671 has: 

> With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.

which means that recursive queries are not supported in SQL commands and DMLs.
With apache#42036 this restriction is lifted and a recursive CTE only doesn't work when the CTE is force inlined (`spark.sql.legacy.inlineCTEInCommands=true` or the command is a multi-insert statement).

### Why are the changes needed?
Recursive query is an ANSI SQL feature that is useful to process hierarchical data.

### Does this PR introduce _any_ user-facing change?
Yes, adds recursive query feature.

### How was this patch tested?
Added new UTs and tests in `cte-recursion.sql`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants