-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24935][SQL] : Problem with Executing Hive UDF's from Spark 2.2 Onwards #23778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
6b90328
[SPARK-24935] : Problem with Executing Hive UDF's from Spark 2.2 Onwards
4cbdf27
[SPARK-24935] : Fixing Unit tests
21371d3
[SPARK-24935] : Removing redundant code line that was failing unit tests
7253983
[SPARK-24935] : Adding Unit Tests
fdfd67c
[SPARK-24935] : Fixing Scalastyle Tests
bd57d0f
[SPARK-24935] : Removing Empty Line
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little lost here. So this
HiveTypedImperativeAggregatehas 2 buffers? What's the difference betweenpartial2ModeBufferandbuffer?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Spark Catalyst has designed UDAF execution such that it uses one aggregation buffer for performing the aggregations for all UDAF operators(Sort based, Object hash based etc.) which makes sense from Spark's point of view. However, from Hive's point of view, two aggregate buffers are expected to be used, one for PARTIAL1/COMPLETE and the other for PARTIAL2/FINAL modes respectively. Since, I did not wish to redesign Catalyst UDAF structure only for Hive, I have let the original calls and buffer be as they are for PARTIAL1/COMPLETE mode and have created the
partial2ModeBufferexclusively for PARTIAL2/FINAL mode operations. Thus, to answer your question,bufferhere is used for Partial1 mode operations andpartial2ModeBufferis used for Partial2 mode operations respectively. I hope that answers your question. Thank you once again for reviewing @cloud-fan .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't quite understand how you make the Hive UDAF work with Spark's two phase aggregate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can you please elaborate more on the two phase aggregate functionality by Spark? That will help me understand and answer your question better. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with the 5 phases of a UDAF:
In Spark, a UDAF will be run twice in two adjacent aggregate operators, called partial aggregate and final aggregate. In the partial aggregate, there are 3 steps:
In the final aggregate, also 3 steps:
But this doesn't work for the 3-phase UDAF which doesn't support partial aggregate.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Hive UDAF, when to use which agg buffer? I think this is the most important information to justify your patch. It will be better if you can point to some Hive doc/code comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I went through Hive docs and asked a couple of people around; officially, hive does not mention anything about using two different aggregation buffers, the main point is to have some kind of distinction between different phases of Hive.
Consider a classic map-reduce process. There are two phases: map and reduce (sometimes an optional combine phase in between). The phases can run on different nodes. The state lives within a phase and does not cross the boundaries. The map phase corresponds to the "partial1" mode (init + iterate + terminate partial). The reduce phase corresponds to the "final" mode (init + merge + terminate). The combine phase corresponds to the "partial2" mode (init + merge + terminate partial). The "complete" mode is a special shortcut to run the whole thing as a single phase (init + iterate + terminate). The bug here is about a state crossing the boundaries between the phases: initialized for one phase (mode), but then passed to a different phase. So by using different aggregation buffers, I am trying to encapsulate the corresponding state within a particular phase. The solution can also be modified to have a single aggregation buffer supporting states of different phases.
In my PR above, the assumption is that the Partial1 aggregation buffer supports phases PARTIAL1/COMPLETE and the Partial2 aggregation buffer supports phases PARTIAL2/FINAL.
I shall also paste a link to a good blog that explains the usage of aggregation buffers in a generic Hive UDAF : https://blog.dataiku.com/2013/05/01/a-complete-guide-to-writing-hive-udf
As this is also a kind of a design change problem, it is completely open to further discussions and improvements. My solution is just one of a kind solution and there are multiple solutions to achieve the same thing. However, as far as I can say, my solution is relatively cleaner and easier to understand and also it does not create a change of any manner in the way with which existing aggregation functions work with Spark SQL(does not break compatibility).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are 4 ways to execute a UDAF
Spark doesn't really have terminate partial. The agg buffer needs to fit the spark schema so Spark can get agg buffer directly. Spark UDAF is flexible: after initialized, the buffer can be updated via either iterate or merge, the buffer can be terminated always.
IIUC
init + merge + terminate finalis pretty common in GROUP BY queries, and Hive UDAF works in this case. Do you know why?And your test case is
init + iterate + terminate final, what's the correct steps to do it? Is itThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sorry for the delayed response, I agree with your point above. However, I did not understand the question correctly.
Apologies if I have misread your above comment and have not answered it appropriately, please let me know. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it clear, the partial 1 buffer can only be used in
iteratorto consume records, and partial 2 buffer can only be used inmergeto consume buffers, do I understand it correctly?