-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
For many of the same reasons as listed on #8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:
- There are some features not available to User Defined Aggregate Functions (such as the faster
GroupsAccumulator
interface) - Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions
The second also ends up causing pushback on adding new aggregates like ARRAY_SUM
in #8325 and geospatial support #7859.
Describe the solution you'd like
I propose moving DataFusion to only use AggregateUDF
s and remove the built in list of AggregateFunctions for the same reasons as #8045
We will keep the existing AggregateUDF
interface as much as possible, while also potentially providing an easier way to define them.
New AggregateUDF is in functions-aggregate
crate
Old Aggregate functions are in datafusion/physical-expr/src/aggregate
Describe alternatives you've considered
Additional context
Proposed implementation steps:
- Use lowercase
name
for aggregate function #10695 - Implement trait based API for defining
AggregateUDF
#8710 - Support
GroupsAccumulator
accumulator for User Defined Functions #8793 - Support ORDER BY in AggregateUDF #8984
- Move
AggregateExpr
,PhysicalExpr
andPhysicalSortExpr
to physical-expr-core #9926 - Simplify API Add a
AggregateUDFImpl::simplfy()
API #9526 - Covariance Sample Move
Covariance
(Sample)covar
/covar_samp
to be a User Defined Aggregate Function #10372 - Covariance Population Move Covariance (Population) covar_pop to be a User Defined Aggregate Function #10418
- Count Move
Count
tofunctions-aggregate
, update MSRV to rust 1.75 #10484 - Median Move Median to
functions-aggregate
and Introduce Numeric signature #10644 - First and Last Make FirstValue an UDAF, Change
AggregateUDFImpl::accumulator
signature, support ORDER BY for UDAFs #9874 Convert first, last aggregate function to UDAF #10648 - Sum Introduce Sum UDAF #10651
- Expression API Introduce expr builder for aggregate function #10560
- Variance Sample Convert
Variance Sample
to UDAF #10667 - Variance Population Convert Variance Population to UDAF #10668
- Stddev Convert
stddev
to udaf #10827 - Approx Distinct Convert approx_distinct to UDAF #10851
- approx_median Convert
approx_median
to UDAF #10838 - approx_percentile_cont approx_percentile_cont_with_weight Convert
ApproxPercentileCont
and `ApproxPercentileContWithWeight
to UDAF #10870 - regr Convert
Regr
to UDAF #10883 - correlation Convert
Correlation
to UDAF #10884 - BitAnd / Or / Xor Convert
BitAnd
,BitOr
,BitXor
to UDAF #10907 - Average Convert
Average
to UDAF #10942 - Min Max Convert
Min
andMax
to UDAF #10943 - String agg Convert
StringAgg
to UDAF #10945 - ArrayAgg Convert
ArrayAgg
to UDAF #10999 - BoolAndOR Convert
BoolAndOr
to UDAF #11008 - Convert
Min
andMax
to UDAF #10943 - Remove special casting of
Min
/Max
built inAggregateFunctions
#11151
Move rust test to sqllogictest if possible #10384
Good first issue list
- grouping Convert
Grouping
to UDAF #10906
Pending
- nth_value
- array_agg_distinct
- array_agg_ordered
impl FromStr for AggregateFunction {
type Err = DataFusionError;
fn from_str(name: &str) -> Result<AggregateFunction> {
Ok(match name {
// general
"avg" => AggregateFunction::Avg,
"bit_and" => AggregateFunction::BitAnd,
"bit_or" => AggregateFunction::BitOr,
"bit_xor" => AggregateFunction::BitXor,
"bool_and" => AggregateFunction::BoolAnd,
"bool_or" => AggregateFunction::BoolOr,
"max" => AggregateFunction::Max,
"mean" => AggregateFunction::Avg,
"min" => AggregateFunction::Min,
"array_agg" => AggregateFunction::ArrayAgg,
"nth_value" => AggregateFunction::NthValue,
"string_agg" => AggregateFunction::StringAgg,
// statistical
"corr" => AggregateFunction::Correlation,
// other
"grouping" => AggregateFunction::Grouping,
_ => {
return plan_err!("There is no built-in function named {name}");
}
})
}
}
Feel free to file an issue if you are interested in working on any of the above in the pending list.