Skip to content

[Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s), improve the system #8708

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

For many of the same reasons as listed on #8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:

  1. There are some features not available to User Defined Aggregate Functions (such as the faster GroupsAccumulator interface)
  2. Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions

The second also ends up causing pushback on adding new aggregates like ARRAY_SUM in #8325 and geospatial support #7859.

Describe the solution you'd like

I propose moving DataFusion to only use AggregateUDFs and remove the built in list of AggregateFunctions for the same reasons as #8045

We will keep the existing AggregateUDF interface as much as possible, while also potentially providing an easier way to define them.

New AggregateUDF is in functions-aggregate crate
Old Aggregate functions are in datafusion/physical-expr/src/aggregate

Describe alternatives you've considered

Additional context

Proposed implementation steps:

Move rust test to sqllogictest if possible #10384

Good first issue list

Pending

  • nth_value
  • array_agg_distinct
  • array_agg_ordered
impl FromStr for AggregateFunction {
    type Err = DataFusionError;
    fn from_str(name: &str) -> Result<AggregateFunction> {
        Ok(match name {
            // general
            "avg" => AggregateFunction::Avg,
            "bit_and" => AggregateFunction::BitAnd,
            "bit_or" => AggregateFunction::BitOr,
            "bit_xor" => AggregateFunction::BitXor,
            "bool_and" => AggregateFunction::BoolAnd,
            "bool_or" => AggregateFunction::BoolOr,
            "max" => AggregateFunction::Max,
            "mean" => AggregateFunction::Avg,
            "min" => AggregateFunction::Min,
            "array_agg" => AggregateFunction::ArrayAgg,
            "nth_value" => AggregateFunction::NthValue,
            "string_agg" => AggregateFunction::StringAgg,
            // statistical
            "corr" => AggregateFunction::Correlation,
            // other
            "grouping" => AggregateFunction::Grouping,
            _ => {
                return plan_err!("There is no built-in function named {name}");
            }
        })
    }
}

Feel free to file an issue if you are interested in working on any of the above in the pending list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions