Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Jul 20, 2022

What changes were proposed in this pull request?

Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (order by ... limit ...) or DS V2 Paging push-down (order by ... limit ... offset ...).
If we can push down aggregate with Top N or Paging, it will be better performance.

This PR only let aggregate pushed down with ORDER BY column which must be GROUP BY column.

The idea of this PR are:

  1. If the pushedAggregate is defined, it tell me we may push down aggregate with Top N or Paging.
  2. Confirm that all the columns of ORDER BY are column of GROUP BY. We need consider user have defined an alias for column of GROUP BY.
  3. If all the columns of ORDER BY are column of GROUP BY, then we can push down aggregate with Top N or Paging.4.

This PR have a key part which is how to get the original group by column name.
For lazily build the Scan, the code show below give an expectation output of ScanBuilderHolder`.

Then the aggregate pushdown will construct an Alias for the group by columns show below.

case ne: NamedExpression => Alias(groupOutput(ordinal), ne.name)(ne.exprId)

or
case ne: NamedExpression => Alias(groupOutput(ordinal), ne.name)(ne.exprId)

So, if we want find out the original group by column, need two steps.
First step, use findGroupColumn to find out the Alias used for attribute starts with group_col_. As you know, the name of Alias may be the origin column name.
Second step, check the attribute looked from first step if it is the origin column by sHolder.relation.output.exists(out => out.semanticEquals(groupCol) .
Third step, recreate the SortOrder with the origin column.

Why are the changes needed?

Let DS V2 aggregate push down can work with Top N or Paging (Sort with group column), then users can get the better performance.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

@beliefer
Copy link
Contributor Author

ping @huaxingao cc @cloud-fan

@beliefer beliefer force-pushed the SPARK-39819 branch 3 times, most recently from 679e705 to f620009 Compare July 23, 2022 12:27
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per SortOrder(attr: AttributeReference..., it's always AttributeReference. Should it address Alias?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If users specify an Alias for group column. It will be Alias(alias: Alias, _).

@beliefer
Copy link
Contributor Author

ping @cloud-fan

@cloud-fan
Copy link
Contributor

I think the key part here is how to get the original group by column name. If we simply allow top n pushdown, it actually works fine, but the column name is incorrect like group_col_0. We need to restore to the original column name. Can you explain more about this part?

@beliefer
Copy link
Contributor Author

I think the key part here is how to get the original group by column name. If we simply allow top n pushdown, it actually works fine, but the column name is incorrect like group_col_0. We need to restore to the original column name. Can you explain more about this part?

PR description updated.

private def findGroupColumn(alias: Alias): Option[AttributeReference] = alias match {
case alias @ Alias(attr: AttributeReference, name) if attr.name.startsWith("group_col_") =>
Some(AttributeReference(name, attr.dataType)(alias.exprId))
case Alias(alias: Alias, _) => findGroupColumn(alias)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's a bit hacky to assume the Alias contains the actual grouping columns. How about we generate the name mapping (grouping attribute to actual group column name) during agg pushdown, put the name mapping in ScanBuilderHolder, and use the mapping to rewrite order by expression during limit pushdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@beliefer
Copy link
Contributor Author

Because #37320 merged, I will close this PR.

@beliefer beliefer closed this Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants