[SPARK-24935][SQL] : Problem with Executing Hive UDF's from Spark 2.2 Onwards #23778

pgandhi999 · 2019-02-13T23:42:36Z

A user of sketches library reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards.

For more details on the issue, you can refer to the discussion in the sketches-user list

On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF provides support for partial aggregation, and has removed the functionality that supported complete mode aggregation (see SPARK-19060 and SPARK-18186).

Thus, instead of expecting update method to be called, merge method is called here which throws the exception as described in the forums above.

What changes were proposed in this pull request?

Created new abstract class HiveTypedImperativeAggregate which is a framework for Hive related aggregation functions.

Also, there seems to be a bug in SortBasedAggregator where it was calling merge on aggregate buffer without initializing them. Have fixed it in this PR.

How was this patch tested?

The steps to reproduce the above issue have been stated in the google group link posted above but will repeat them here for convenience:

1. Download the following three jars from the maven repository in here.

sketches-core
sketches-hive
memory

2. Launch spark-shell by adding the above jars in the driver as well as executor classpath and run the following commands:

def randId = scala.util.Random.nextInt(10000)+1

def randomStringFromCharList(length: Int, chars: Seq[Char]): String = {
  val sb = new StringBuilder
  for (i <- 1 to length) {
    val randomNum = util.Random.nextInt(chars.length)
    sb.append(chars(randomNum))
  }
  sb.toString
}

def randomAlphaNumericString(length: Int): String = {
  val chars = ('a' to 'z') ++ ('A' to 'Z') ++ ('0' to '9')
  randomStringFromCharList(length, chars)
}

val df = sc.parallelize(Seq.fill(1000000) {
  randId,randomAlphaNumericString(64)
}).toDF("id","value")

spark.sql("CREATE TEMPORARY FUNCTION data2sketch AS 'com.yahoo.sketches.hive.hll.DataToSketchUDAF'")

val nDf = df.groupBy("id").agg(expr("data2sketch(value, 21, 'HLL_4') as hll")).select(col("id"),col("hll"))
nDf.show(10,false)

3. You will see the following exception below:

[Stage 0:>                                                          (0 + 2) / 2]18/08/19 00:47:24 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, gsrd259n31.red.ygrid.yahoo.com, executor 2): java.lang.ClassCastException: com.yahoo.sketches.hive.hll.SketchState cannot be cast to com.yahoo.sketches.hive.hll.UnionState
	at com.yahoo.sketches.hive.hll.SketchEvaluator.merge(SketchEvaluator.java:56)
	at com.yahoo.sketches.hive.hll.DataToSketchUDAF$DataToSketchEvaluator.merge(DataToSketchUDAF.java:100)
	at org.apache.spark.sql.hive.HiveUDAFFunction.merge(hiveUDFs.scala:420)
	at org.apache.spark.sql.hive.HiveUDAFFunction.merge(hiveUDFs.scala:307)
	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.merge(interfaces.scala:541)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$2.apply(AggregationIterator.scala:174)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$2.apply(AggregationIterator.scala:174)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:188)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:182)
	at org.apache.spark.sql.execution.aggregate.SortBasedAggregator$$anon$1.findNextSortedGroup(ObjectAggregationIterator.scala:275)
	at org.apache.spark.sql.execution.aggregate.SortBasedAggregator$$anon$1.hasNext(ObjectAggregationIterator.scala:247)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.hasNext(ObjectAggregationIterator.scala:81)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:367)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

4. After the code changes in this PR, run the same test as above and it should work.

Created new abstract class HiveTypedImperativeAggregate which is a framework for hive related aggregation functions. Also, there seems to be a bug in SortBasedAggregator where it was calling merge on aggregate buffer without initializing them. Have fixed it in this PR.

pgandhi999 · 2019-02-13T23:44:04Z

ok to test

SparkQA · 2019-02-14T01:41:25Z

Test build #102315 has finished for PR 23778 at commit 6b90328.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class HiveTypedImperativeAggregate[T] extends TypedImperativeAggregate[T]

maropu · 2019-02-14T02:13:29Z

could you add tests in this pr?

pgandhi999 · 2019-02-14T03:13:55Z

@maropu Yes, I am definitely going to add unit tests here, for now I am working on fixing the unit tests as this PR is failing a bunch of hive compatibility tests(trying to figure out why :) ).

SparkQA · 2019-02-15T04:04:17Z

Test build #102373 has finished for PR 23778 at commit 4cbdf27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-15T20:16:54Z

Test build #102396 has finished for PR 23778 at commit 21371d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-21T16:30:57Z

can you briefly explain how the hive UDAF works? Then we can have more people looking at it and see how to map it to Spark's AggregateFunction interface.

pgandhi999 · 2019-02-21T21:35:12Z

Sure @cloud-fan . Thank you for your response. As far as my understanding of Hive UDAF is concerned, I can roughly classify them into into types: those that support partial aggregation(Mode PARTIAL and FINAL) and those that do not(Mode COMPLETE). For the Hive UDAFs that support partial aggregation, there are five phases:

Initialize: The aggregation buffers for PARTIAL1 Mode and PARTIAL2 Mode are created in this phase.
Iterate(Update) : This state processes a new row of data into the aggregation buffer created for PARTIAL1.
TerminatePartial: Returns the contents of the aggregation buffer.
Merge: Merges a partial aggregation returned by calling terminatePartial() on PARTIAL1 aggregation buffer into the current aggregation happening on PARTIAL2 aggregation buffer.
Terminate: Returns the final result of the aggregation stored in PARTIAL2 buffer to Hive.

For the Hive UDAFs that do not support partial aggregation, I have seen the following three phases:

Initialize: Initialize the aggregation buffer.
Iterate(Update): Process the rows into the buffer.
Terminate: Return the final result.

For more information, you may find this link helpful: https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

This information is based on what I have found out during my tests and reading through the docs and it is based on this information that I have modeled the behaviour of the class HiveTypedImperativeAggregate. I am by no means an expert on Hive, so if you feel that my summary on Hive UDAFs is incorrect or is missing something, please let me know.

SparkQA · 2019-02-23T00:18:27Z

Test build #102685 has finished for PR 23778 at commit 7253983.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MockUDAF2 extends AbstractGenericUDAFResolver
class MockUDAF3 extends AbstractGenericUDAFResolver
class MockUDAFBuffer2(var nonNullCount: Long, var nullCount: Long)
class MockUDAFEvaluator2 extends GenericUDAFEvaluator
class MockUDAFEvaluator3 extends GenericUDAFEvaluator

SparkQA · 2019-02-23T00:32:50Z

Test build #102686 has finished for PR 23778 at commit fdfd67c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-23T05:47:45Z

Test build #102687 has finished for PR 23778 at commit bd57d0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pgandhi999 · 2019-02-25T14:35:08Z

@maropu @cloud-fan I have added unit tests to the PR for a more detailed understanding of the issue as well as the fix. Thank you.

cloud-fan · 2019-02-25T15:21:04Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+
  final override def initialize(buffer: InternalRow): Unit = {
+    partial2ModeBuffer = buffer.copy()
+    partial2ModeBuffer(mutableAggBufferOffset) = createPartial2ModeAggregationBuffer()


I'm a little lost here. So this HiveTypedImperativeAggregate has 2 buffers? What's the difference between partial2ModeBuffer and buffer?

So Spark Catalyst has designed UDAF execution such that it uses one aggregation buffer for performing the aggregations for all UDAF operators(Sort based, Object hash based etc.) which makes sense from Spark's point of view. However, from Hive's point of view, two aggregate buffers are expected to be used, one for PARTIAL1/COMPLETE and the other for PARTIAL2/FINAL modes respectively. Since, I did not wish to redesign Catalyst UDAF structure only for Hive, I have let the original calls and buffer be as they are for PARTIAL1/COMPLETE mode and have created the partial2ModeBuffer exclusively for PARTIAL2/FINAL mode operations. Thus, to answer your question, buffer here is used for Partial1 mode operations and partial2ModeBuffer is used for Partial2 mode operations respectively. I hope that answers your question. Thank you once again for reviewing @cloud-fan .

But I don't quite understand how you make the Hive UDAF work with Spark's two phase aggregate?

I see. Can you please elaborate more on the two phase aggregate functionality by Spark? That will help me understand and answer your question better. Thank you.

Let's start with the 5 phases of a UDAF:

Initialize: The aggregation buffers for PARTIAL1 Mode and PARTIAL2 Mode are created in this phase.

Iterate(Update) : This state processes a new row of data into the aggregation buffer created for PARTIAL1.

TerminatePartial: Returns the contents of the aggregation buffer.

Merge: Merges a partial aggregation returned by calling terminatePartial() on PARTIAL1 aggregation buffer into the current aggregation happening on PARTIAL2 aggregation buffer.

Terminate: Returns the final result of the aggregation stored in PARTIAL2 buffer to Hive.

In Spark, a UDAF will be run twice in two adjacent aggregate operators, called partial aggregate and final aggregate. In the partial aggregate, there are 3 steps:

initialize the UDAF

update UDAF with input data (so-called Iterate)

return the UDAF buffer (so-called TerminatePartial)

In the final aggregate, also 3 steps:

initialize the UDAF

update UDAF with buffer data from the partial aggregate (so-called Merge)

return final result (so-called Terminate)

But this doesn't work for the 3-phase UDAF which doesn't support partial aggregate.

In Hive UDAF, when to use which agg buffer? I think this is the most important information to justify your patch. It will be better if you can point to some Hive doc/code comments.

So I went through Hive docs and asked a couple of people around; officially, hive does not mention anything about using two different aggregation buffers, the main point is to have some kind of distinction between different phases of Hive.

Consider a classic map-reduce process. There are two phases: map and reduce (sometimes an optional combine phase in between). The phases can run on different nodes. The state lives within a phase and does not cross the boundaries. The map phase corresponds to the "partial1" mode (init + iterate + terminate partial). The reduce phase corresponds to the "final" mode (init + merge + terminate). The combine phase corresponds to the "partial2" mode (init + merge + terminate partial). The "complete" mode is a special shortcut to run the whole thing as a single phase (init + iterate + terminate). The bug here is about a state crossing the boundaries between the phases: initialized for one phase (mode), but then passed to a different phase. So by using different aggregation buffers, I am trying to encapsulate the corresponding state within a particular phase. The solution can also be modified to have a single aggregation buffer supporting states of different phases.

In my PR above, the assumption is that the Partial1 aggregation buffer supports phases PARTIAL1/COMPLETE and the Partial2 aggregation buffer supports phases PARTIAL2/FINAL.

I shall also paste a link to a good blog that explains the usage of aggregation buffers in a generic Hive UDAF : https://blog.dataiku.com/2013/05/01/a-complete-guide-to-writing-hive-udf

As this is also a kind of a design change problem, it is completely open to further discussions and improvements. My solution is just one of a kind solution and there are multiple solutions to achieve the same thing. However, as far as I can say, my solution is relatively cleaner and easier to understand and also it does not create a change of any manner in the way with which existing aggregation functions work with Spark SQL(does not break compatibility).

So there are 4 ways to execute a UDAF

init + iterate + terminate partial

init + merge + terminate final

init + merge + terminate partial

init + iterate + terminate final

Spark doesn't really have terminate partial. The agg buffer needs to fit the spark schema so Spark can get agg buffer directly. Spark UDAF is flexible: after initialized, the buffer can be updated via either iterate or merge, the buffer can be terminated always.

IIUC init + merge + terminate final is pretty common in GROUP BY queries, and Hive UDAF works in this case. Do you know why?

And your test case is init + iterate + terminate final, what's the correct steps to do it? Is it

1. create partial1 buffer 2. iterate 3 turn partial1 buffer to partial2 buffer 4. terminate final

@cloud-fan Sorry for the delayed response, I agree with your point above. However, I did not understand the question correctly.

IIUC init + merge + terminate final is pretty common in GROUP BY queries, and Hive UDAF works in this case. Do you know why?

And your test case is init + iterate + terminate final
Actually my test case is init + iterate + terminate partial alongwith init + iterate + merge + terminate partial and finally ending with init + merge + terminate final. So according to me, the correct steps here would roughly be:

1. create partial1 buffer 2. iterate 3. merge partial1 buffer into partial2 buffer 4. terminate final

Apologies if I have misread your above comment and have not answered it appropriately, please let me know. Thank you.

to make it clear, the partial 1 buffer can only be used in iterator to consume records, and partial 2 buffer can only be used in merge to consume buffers, do I understand it correctly?

HyukjinKwon · 2019-03-18T04:46:41Z

We're now going ahead for Spark 3.0. Do we still need to support this case of Hive UDAF back? Sorry if I am asking a question that's already answered somewhere.

pgandhi999 · 2019-03-19T14:03:53Z

@cloud-fan Your other PR fixes the Hive UDAF bug but the issue with SortBasedAggregator where it was calling merge on aggregate buffer without initializing them still remains which this PR addresses. I will go ahead and clean up this PR to put only the fix for the above bug. Thank you.

## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close #23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit a6c207c) Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close apache#23778 ## How was this patch tested? a new test Closes apache#24144 from cloud-fan/hive. Lead-authored-by: pgandhi <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…or So… …rtBasedAggregate Normally, the aggregate operations that are invoked for an aggregation buffer for User Defined Aggregate Functions(UDAF) follow the order like initialize(), update(), eval() OR initialize(), merge(), eval(). However, after a certain threshold configurable by spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge or update operation without calling initialize() on the aggregate buffer. ## What changes were proposed in this pull request? The fix here is to initialize aggregate buffers again when fallback to SortBasedAggregate operator happens. ## How was this patch tested? The patch was tested as part of [SPARK-24935](https://issues.apache.org/jira/browse/SPARK-24935) as documented in PR #23778. Closes #24149 from pgandhi999/SPARK-27207. Authored-by: pgandhi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close apache#23778 ## How was this patch tested? a new test Closes apache#24144 from cloud-fan/hive. Lead-authored-by: pgandhi <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit a6c207c) Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close apache/spark#23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit a6c207c) Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 0cfefa7)

[SPARK-24935] : Fixing Unit tests

4cbdf27

[SPARK-24935] : Removing redundant code line that was failing unit tests

21371d3

pgandhi999 mentioned this pull request Feb 15, 2019

[SPARK-24935][SQL] : Problem with Executing Hive UDF's from Spark 2.2 Onwards #22144

Closed

[SPARK-24935] : Adding Unit Tests

7253983

[SPARK-24935] : Fixing Scalastyle Tests

fdfd67c

[SPARK-24935] : Removing Empty Line

bd57d0f

cloud-fan reviewed Feb 25, 2019

View reviewed changes

cloud-fan mentioned this pull request Mar 19, 2019

[SPARK-24935][SQL] fix Hive UDAF with two aggregation buffers #24144

Closed

pgandhi999 mentioned this pull request Mar 19, 2019

[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… #24149

Closed

gatorsmile closed this in a6c207c Mar 24, 2019

[SPARK-24935][SQL] : Problem with Executing Hive UDF's from Spark 2.2 Onwards #23778

[SPARK-24935][SQL] : Problem with Executing Hive UDF's from Spark 2.2 Onwards #23778

Uh oh!

Conversation

pgandhi999 commented Feb 13, 2019 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

pgandhi999 commented Feb 13, 2019

Uh oh!

SparkQA commented Feb 14, 2019

Uh oh!

maropu commented Feb 14, 2019

Uh oh!

pgandhi999 commented Feb 14, 2019

Uh oh!

SparkQA commented Feb 15, 2019

Uh oh!

SparkQA commented Feb 15, 2019

Uh oh!

cloud-fan commented Feb 21, 2019

Uh oh!

pgandhi999 commented Feb 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 23, 2019

Uh oh!

SparkQA commented Feb 23, 2019

Uh oh!

SparkQA commented Feb 23, 2019

Uh oh!

pgandhi999 commented Feb 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 18, 2019

Uh oh!

pgandhi999 commented Mar 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pgandhi999 commented Feb 13, 2019 •

edited by HyukjinKwon

Loading

pgandhi999 commented Feb 21, 2019 •

edited

Loading

cloud-fan Mar 6, 2019 •

edited

Loading