Skip to content

Conversation

@hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented Aug 1, 2024

What changes were proposed in this pull request?

This PR introduces a separate FunctionRegistry for functions used by the Column API that should not be exposed in the global function namespace. This internal registry is only used when then the UnresolvedFunction has the isInternal flag set to true.

Why are the changes needed?

We want to create a Column API shared by the Classic and Connect Scala Clients. This requires that we fully decouple the Column API from Catalyst. A part of this work is decoupling function resolution.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 1, 2024
@hvanhovell hvanhovell requested a review from cloud-fan August 1, 2024 20:52
@hvanhovell hvanhovell requested a review from HyukjinKwon August 1, 2024 20:52
@HyukjinKwon
Copy link
Member

cc @zhengruifeng

@github-actions github-actions bot removed the CONNECT label Aug 5, 2024
}
}

private object PartitionTransform {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove the actual expressions, they serve no real purpose.

if (name.length == 1) {
u: UnresolvedFunction): Option[Expression] = {
if (name.size == 1 && u.isInternal) {
Option(FunctionRegistry.internal.lookupFunction(FunctionIdentifier(name.head), arguments))
Copy link
Contributor Author

@hvanhovell hvanhovell Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use the internal namespace we use that, and we do not fallback to something else.

// registered in the internal function registry, and we reroute the lookup to the internal
// registry.
val name = fun.getFunctionName
val internal = FunctionRegistry.internal.functionExists(FunctionIdentifier(name))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to move connect's internal method invocation to a special namespace. That is a bit cheaper than doing this lookup. I will do this in a follow-up.

@hvanhovell hvanhovell changed the title [SPARK-49004][CONNECT] Register Column API internal functions in separate namespace [SPARK-49004][CONNECT] Use separate registry for Column API internal functions Aug 5, 2024
// registered in the internal function registry, and we reroute the lookup to the internal
// registry.
val name = fun.getFunctionName
val internal = FunctionRegistry.internal.functionExists(FunctionIdentifier(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about which option is cleaner: 1) do an existence check ahead here. 2) fallback internal function lookup to normal function lookup

@hvanhovell thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to fallback because I don't want these internal functions to clash with any UDF the user specifies. If you want an internal function you will get an internal function. In a follow-up I want to use a special prefix for connect internal functions, and probably add a conf controlling this lookup behavior (disabling it for newer clients).

ignoreNulls: Boolean = false,
orderingWithinGroup: Seq[SortOrder] = Seq.empty)
orderingWithinGroup: Seq[SortOrder] = Seq.empty,
isInternal: Boolean = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnresolvedFunction is an internal class, but I'm wondering if it's safer to add ExtendedUnresolvedFunction so that we can keep backward compatibility for UnresolvedFunction which might be used by custom catalyst rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine for two reasons:

  1. It should be very rare to add an internal function. As a 3p extension developer you generally want the SQL Dataframe surface to match.
  2. A 3p developer should not really be creating these UnresolvedFunctions. This should be done using the Column API, and indirectly through the Spark API.

@hvanhovell
Copy link
Contributor Author

Merging.

@asfgit asfgit closed this in 0346b18 Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants