Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

backport #24144 and #24459 to 2.3.

How was this patch tested?

existing tests

pgandhi and others added 2 commits May 7, 2019 00:35
## What changes were proposed in this pull request?

Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107).

However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it.

All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests.

close apache#23778

## How was this patch tested?

a new test

Closes apache#24144 from cloud-fan/hive.

Lead-authored-by: pgandhi <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
…H in Hive UDAF adapter

## What changes were proposed in this pull request?

This is a followup of apache#24144 . apache#24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH.

However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE.

This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to  INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH.

## How was this patch tested?

a new test case

Closes apache#24459 from cloud-fan/hive-udaf.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor Author

cc @gatorsmile

@pgandhi999
Copy link

+1

@SparkQA
Copy link

SparkQA commented May 6, 2019

Test build #105159 has finished for PR 24539 at commit f16ff36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. This is a clean cherry-pick of both patches.
Merged to branch-2.3

dongjoon-hyun pushed a commit that referenced this pull request May 6, 2019
## What changes were proposed in this pull request?

backport #24144 and #24459 to 2.3.

## How was this patch tested?

existing tests

Closes #24539 from cloud-fan/backport.

Lead-authored-by: pgandhi <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants