Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

We found two analyzer rule execution order issues in our internal workloads:

  • CreateStruct.apply creates NamePlaceholder for unresolved NamedExpression. However, with certain rule execution order, the NamedExpression may be removed (e.g. remove unnecessary Alias) before NamePlaceholder is resolved, then NamePlaceholder can't be resolved anymore.
  • UNPIVOT uses UnresolvedAlias to wrap UnresolvedAttribute. There is a conflict about how to determine the final alias name. If ResolveAliases runs first, then UnresolvedAlias will be removed and eventually the alias will be b for nested column a.b. If ResolveReferences runs first, then we resolve a.b first and then UnresolvedAlias will determine the alias as a.b not b.

This PR fixes the two issues

  • CreateStruct.apply should determine the field name immediately if the input is Alias
  • The parser rule for UNPIVOT should follow how we parse SELECT and return UnresolvedAttribute directly without the UnresolvedAlias wrapper. It's a bit risky to fix the order issue between ResolveAliases and ResolveReferences as it can change the final query schema, we will save it for later.

Why are the changes needed?

fix unstable analyzer behavior with different rule execution orders.

Does this PR introduce any user-facing change?

Yes, some failed queries can run now. The issue for UNPIVOT only affects the error message.

How was this patch tested?

verified by our internal workloads. The repro query is quite complicated to trigger a certain rule execution order so we won't add tests for it. The fix is quite obvious.

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Mar 26, 2024
@cloud-fan
Copy link
Contributor Author

cc @viirya @gengliangwang

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan .

@dongjoon-hyun
Copy link
Member

According to the JIRA Affected Version (4.0.0), I merged this to master branch only. Please make backporting PRs if we need this in release branches.

@viirya
Copy link
Member

viirya commented Mar 26, 2024

Looks good to me.

cloud-fan added a commit that referenced this pull request Mar 26, 2024
### What changes were proposed in this pull request?

We found two analyzer rule execution order issues in our internal workloads:
- `CreateStruct.apply` creates `NamePlaceholder` for unresolved `NamedExpression`. However, with certain rule execution order, the `NamedExpression` may be removed (e.g. remove unnecessary `Alias`) before `NamePlaceholder` is resolved, then `NamePlaceholder` can't be resolved anymore.
- UNPIVOT uses `UnresolvedAlias` to wrap `UnresolvedAttribute`. There is a conflict about how to determine the final alias name. If `ResolveAliases` runs first, then `UnresolvedAlias` will be removed and eventually the alias will be `b` for nested column `a.b`. If `ResolveReferences` runs first, then we resolve `a.b` first and then `UnresolvedAlias` will determine the alias as `a.b` not `b`.

This PR fixes the two issues
- `CreateStruct.apply` should determine the field name immediately if the input is `Alias`
- The parser rule for UNPIVOT should follow how we parse SELECT and return `UnresolvedAttribute` directly without the `UnresolvedAlias` wrapper. It's a bit risky to fix the order issue between `ResolveAliases` and `ResolveReferences` as it can change the final query schema, we will save it for later.

### Why are the changes needed?

fix unstable analyzer behavior with different rule execution orders.

### Does this PR introduce _any_ user-facing change?

Yes, some failed queries can run now. The issue for UNPIVOT only affects the error message.

### How was this patch tested?

verified by our internal workloads. The repro query is quite complicated to trigger a certain rule execution order so we won't add tests for it. The fix is quite obvious.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45718 from cloud-fan/rule.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@cloud-fan
Copy link
Contributor Author

I backported it to 3.5 directly since there is no merge conflicts. Since it's very hard to hit this bug, I didn't backport it further. I've updated the JIRA ticket as well.

sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?

We found two analyzer rule execution order issues in our internal workloads:
- `CreateStruct.apply` creates `NamePlaceholder` for unresolved `NamedExpression`. However, with certain rule execution order, the `NamedExpression` may be removed (e.g. remove unnecessary `Alias`) before `NamePlaceholder` is resolved, then `NamePlaceholder` can't be resolved anymore.
- UNPIVOT uses `UnresolvedAlias` to wrap `UnresolvedAttribute`. There is a conflict about how to determine the final alias name. If `ResolveAliases` runs first, then `UnresolvedAlias` will be removed and eventually the alias will be `b` for nested column `a.b`. If `ResolveReferences` runs first, then we resolve `a.b` first and then `UnresolvedAlias` will determine the alias as `a.b` not `b`.

This PR fixes the two issues
- `CreateStruct.apply` should determine the field name immediately if the input is `Alias`
- The parser rule for UNPIVOT should follow how we parse SELECT and return `UnresolvedAttribute` directly without the `UnresolvedAlias` wrapper. It's a bit risky to fix the order issue between `ResolveAliases` and `ResolveReferences` as it can change the final query schema, we will save it for later.

### Why are the changes needed?

fix unstable analyzer behavior with different rule execution orders.

### Does this PR introduce _any_ user-facing change?

Yes, some failed queries can run now. The issue for UNPIVOT only affects the error message.

### How was this patch tested?

verified by our internal workloads. The repro query is quite complicated to trigger a certain rule execution order so we won't add tests for it. The fix is quite obvious.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45718 from cloud-fan/rule.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…#377)

### What changes were proposed in this pull request?

We found two analyzer rule execution order issues in our internal workloads:
- `CreateStruct.apply` creates `NamePlaceholder` for unresolved `NamedExpression`. However, with certain rule execution order, the `NamedExpression` may be removed (e.g. remove unnecessary `Alias`) before `NamePlaceholder` is resolved, then `NamePlaceholder` can't be resolved anymore.
- UNPIVOT uses `UnresolvedAlias` to wrap `UnresolvedAttribute`. There is a conflict about how to determine the final alias name. If `ResolveAliases` runs first, then `UnresolvedAlias` will be removed and eventually the alias will be `b` for nested column `a.b`. If `ResolveReferences` runs first, then we resolve `a.b` first and then `UnresolvedAlias` will determine the alias as `a.b` not `b`.

This PR fixes the two issues
- `CreateStruct.apply` should determine the field name immediately if the input is `Alias`
- The parser rule for UNPIVOT should follow how we parse SELECT and return `UnresolvedAttribute` directly without the `UnresolvedAlias` wrapper. It's a bit risky to fix the order issue between `ResolveAliases` and `ResolveReferences` as it can change the final query schema, we will save it for later.

### Why are the changes needed?

fix unstable analyzer behavior with different rule execution orders.

### Does this PR introduce _any_ user-facing change?

Yes, some failed queries can run now. The issue for UNPIVOT only affects the error message.

### How was this patch tested?

verified by our internal workloads. The repro query is quite complicated to trigger a certain rule execution order so we won't add tests for it. The fix is quite obvious.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45718 from cloud-fan/rule.

Authored-by: Wenchen Fan <[email protected]>

Signed-off-by: Dongjoon Hyun <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants