-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49004][CONNECT] Use separate registry for Column API internal functions #47572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| } | ||
|
|
||
| private object PartitionTransform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove the actual expressions, they serve no real purpose.
| if (name.length == 1) { | ||
| u: UnresolvedFunction): Option[Expression] = { | ||
| if (name.size == 1 && u.isInternal) { | ||
| Option(FunctionRegistry.internal.lookupFunction(FunctionIdentifier(name.head), arguments)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use the internal namespace we use that, and we do not fallback to something else.
| // registered in the internal function registry, and we reroute the lookup to the internal | ||
| // registry. | ||
| val name = fun.getFunctionName | ||
| val internal = FunctionRegistry.internal.functionExists(FunctionIdentifier(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to move connect's internal method invocation to a special namespace. That is a bit cheaper than doing this lookup. I will do this in a follow-up.
| // registered in the internal function registry, and we reroute the lookup to the internal | ||
| // registry. | ||
| val name = fun.getFunctionName | ||
| val internal = FunctionRegistry.internal.functionExists(FunctionIdentifier(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about which option is cleaner: 1) do an existence check ahead here. 2) fallback internal function lookup to normal function lookup
@hvanhovell thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to fallback because I don't want these internal functions to clash with any UDF the user specifies. If you want an internal function you will get an internal function. In a follow-up I want to use a special prefix for connect internal functions, and probably add a conf controlling this lookup behavior (disabling it for newer clients).
| ignoreNulls: Boolean = false, | ||
| orderingWithinGroup: Seq[SortOrder] = Seq.empty) | ||
| orderingWithinGroup: Seq[SortOrder] = Seq.empty, | ||
| isInternal: Boolean = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnresolvedFunction is an internal class, but I'm wondering if it's safer to add ExtendedUnresolvedFunction so that we can keep backward compatibility for UnresolvedFunction which might be used by custom catalyst rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine for two reasons:
- It should be very rare to add an internal function. As a 3p extension developer you generally want the SQL Dataframe surface to match.
- A 3p developer should not really be creating these UnresolvedFunctions. This should be done using the Column API, and indirectly through the Spark API.
|
Merging. |
What changes were proposed in this pull request?
This PR introduces a separate FunctionRegistry for functions used by the Column API that should not be exposed in the global function namespace. This internal registry is only used when then the
UnresolvedFunctionhas theisInternalflag set totrue.Why are the changes needed?
We want to create a Column API shared by the Classic and Connect Scala Clients. This requires that we fully decouple the Column API from Catalyst. A part of this work is decoupling function resolution.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.