-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20396][SQL][PySpark] groupby().apply() with pandas udf #18732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f3bbb86 to
76a7ce6
Compare
ebc2c67 to
34e1dd4
Compare
|
cool - this is a bit understated but potentially huge (to me anyway) |
32ad7b2 to
3237cd0
Compare
|
Should we maybe consider SPIP? |
|
Thanks all for comment. This is part of SPIP https://issues.apache.org/jira/browse/SPARK-21190 I don't expect we start to merge this until we have a solid design in SPARK-21190. It would be great if Spark commiters can help move the discussion on SPARK-21190 forward. |
|
there's actually a number of key people participating in the discussion in JIRA/SPIP, so I think we are good I think perhaps SPARK-20396 should be a subtask instead for tracking the overall discussion/design. |
c8a20cc to
8630028
Compare
8630028 to
07bccca
Compare
|
Hi, Thanks to the vectorized udf change, this PR is much more smaller than the original. I think this is a useful feature and would love to get some feedback on this. Thoughts? cc @BryanCutler @HyukjinKwon @ueshin |
|
ok to test |
|
Test build #82288 has finished for PR 18732 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good @icexelloss! I'll have to look at this more in depth later as it touches a lot of code I'm not familiar with. Hopefully, someone better versed in this area can help guide you with what needs to be done to get this merged, like additional tests to add.
One question from previous discussion in the JIRA, is the length of the Pandas DataFrame from apply() determined by maxRecordsPerBatch? So if you wanted to work with an entire groupby() key, you would need to set this conf big enough?
python/pyspark/sql/functions.py
Outdated
| """ | ||
| import pandas as pd | ||
| if isinstance(returnType, pd.Series): | ||
| returnType = from_pandas_dtypes(returnType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need this. Use consistent way to express the return type should be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree having a consistent way to express return type is good.
The reason I added this is to enable this usage:
sample_df = df.filter(df.id == 1).toPandas()
def foo(df):
ret = # Some transformation on the input pd.DataFrame
return ret
foo_udf = pandas_udf(foo, foo(sample_df).dtypes)
df.groupBy('id').apply(foo_udf)
The pattern is quite useful in interactive usage. Here the user no longer needs to specify the return schema of the foo manually. And if the user changes the return columns of foo, they don't need to change the return type of pandas_udf.
I am leaning towards keeping this but I am willing to be convinced.
| [StructField('id', LongType()), | ||
| StructField('v', IntegerType()), | ||
| StructField('v1', DoubleType()), | ||
| StructField('v2', LongType())])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is a little different than current pandas_udfs. Are the resulting column names determined here? Does it have to be a StructType to do groupby().apply()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the column names are specified in the returnType and the returnType must be a StructType.
The rational is that apply() is a mapping from a pd.Dataframe -> pd.DataFrame, therefore the returnType must be a StructType.
This is the best way I can think of to specify the column names and returnType, it makes sense to me because there should be a one-to-one mapping between the return value of the function (a pd.DataFrame) and it's schema (a StructType containing column names and dataType)
Also because pd.DataFrame doesn't support nested types, there is no ambiguity whether a StructType indicates a pd.DataFrame or nested type either.
|
|
||
| val batchedIter: Iterator[Iterator[InternalRow]] = | ||
| iter.grouped(conf.arrowMaxRecordsPerBatch).map(_.iterator) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to group the iterator like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grouped iterator looks unnecessary. Actually you still write out the rows individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is for making ArrowPythonRunner reusable between current pandas udf and apply() by taking Iterator[Iterator[InternalRow]] instead of Iterator[InternalRow] as its input. The rows in grouped iterator will be one RecordBatch.
I'm not sure whether it's good or not, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually find this code doesn't work now. I will fix it.
@ueshin is right, this is to reuse ArrowEvalPython for both the current pandas udf and apply(). I basically want to lift the batching logic out of ArrowEvalPython so the called and decide how they want rows to be batched into RecordBatch.
In the current pandas udf case, it batches it by conf.arrowMaxRecordsPerBatch and in apply it batches by one group per batch.
|
I believe I should cc @cloud-fan and @viirya too. Will take a closer look too soon. |
| import inspect | ||
| if len(inspect.getargspec(f).args) == 0: | ||
| argspec = inspect.getargspec(f) | ||
| if len(argspec.args) == 0 and argspec.varargs is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, let's address this comment while we are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
Based on the doc of Conceptually I'd see them as two types of udfs. You can't exchange them in usage. Maybe we can define another |
python/pyspark/sql/group.py
Outdated
| jgd = self._jgd.pivot(pivot_col, values) | ||
| return GroupedData(jgd, self.sql_ctx) | ||
|
|
||
| def apply(self, udf_obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may possible pass in non-vectorized udf. Add a check for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss I think this is a useful feature, too!
I left some comments for now, could you please check them as well?
| grouping: Seq[Expression], | ||
| func: Expression, | ||
| override val output: Seq[Attribute], | ||
| override val child: SparkPlan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need override val for output and child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| func: Expression, | ||
| override val output: Seq[Attribute], | ||
| override val child: SparkPlan | ||
| ) extends UnaryExecNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style
...)
extends UnaryExecNode {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| .map { case (attr, i) => attr.withName(s"_$i") }) | ||
|
|
||
| val batchedIter: Iterator[Iterator[InternalRow]] = | ||
| iter.grouped(conf.arrowMaxRecordsPerBatch).map(_.iterator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if conf.arrowMaxRecordsPerBatch <= 0?
| def apply(a: Attribute): AttributeSet = new AttributeSet(Set(new AttributeEquals(a))) | ||
|
|
||
| def apply(as: Attribute*): AttributeSet = | ||
| new AttributeSet(Set(as.map(new AttributeEquals(_)): _*)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? It seems this isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Removed.
| import org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer | ||
| import org.apache.spark.sql.catalyst.encoders._ | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.catalyst.expressions.{AttributeSet, _} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revert this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| case class FlatMapGroupsInPandas( | ||
| groupingExprs: Seq[Expression], | ||
| functionExpr: Expression, | ||
| override val output: Seq[Attribute], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need override val here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| df: DataFrame, | ||
| groupingExprs: Seq[Expression], | ||
| val df: DataFrame, | ||
| val groupingExprs: Seq[Expression], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these val for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val df is used for accessing the jdf object from python:
Alternatively, I can restore a reference to the python DataFrame in python GroupedData object. It doesn't seem to be much different though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed val from groupingExprs
|
|
||
| private[sql] def flatMapGroupsInPandas( | ||
| expr: PythonUDF | ||
| ): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can make this one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| val argOffsets = Array((0 until child.schema.length).toArray) | ||
|
|
||
| inputRDD.mapPartitionsInternal { iter => | ||
| val grouped = GroupedIterator(iter, groupingAttributes, child.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use grouping instead of groupingAttributes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places pass groupingAttributes to GroupedIterator. What's the difference between the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the implementation on that time doesn't support grouping like:
df.groupby(col('id') % 2 == 0).apply(...)but the change I supposed doesn't work either.
The current implementation seems to not support the grouping above, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent a pr to your repository to support these cases icexelloss#4.
Could you take a look at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks much! I will take a look now.
|
|
||
| private[sql] def flatMapGroupsInPandas( | ||
| expr: PythonUDF | ||
| ): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The passed PythonUDF can possibly be non-vectorized UDF too. Add a check for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/pyspark/sql/group.py
Outdated
|
|
||
| df = DataFrame(self._jgd.df(), self.sql_ctx) | ||
| func = udf_obj.func | ||
| returnType = udf_obj.returnType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if the return type is struct type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/pyspark/sql/group.py
Outdated
| jgd = self._jgd.pivot(pivot_col, values) | ||
| return GroupedData(jgd, self.sql_ctx) | ||
|
|
||
| def apply(self, udf_obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can define another pandas_df_udf for this kind of pandas udf? We can also check for this kind of pandas udf, e.g. I think it should have just one parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if that is necessary. We can check if the function should have just one parameter in apply() without introducing a new pandas_df_udf too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm basically concerned that there is no distinct difference between the current pandas udf and the new one for apply. But seems we can distinguish them by looking at the return type? If so, we may no need of pandas_df_udf.
But we should update the doc of pandas_udf for this kind of (apply) pandas udf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It's a totally valid concern. Yeah I think we can distinguish them by returnType.
I will update the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc updated
|
Thanks all for the initial review! I will address some comments and upload a new version today. |
| /** | ||
| * Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas]] | ||
| * | ||
| * Rows in each group are passed to the python worker as a Arrow record batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Arrow -> an Arrow
minor nits: capitalize Python and Java, and change to Pandas.DataFrame in these paragraphs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed "a Arrow -> an Arrow"
Fixed "Python and Java capitalization"
I am actually leaning toward keeping pandas.DataFrame . The preference to pandas is usually lower case:
https://pandas.pydata.org/pandas-docs/stable/
|
I had some minor comments on the docs, otherwise LGTM! |
|
Test build #82587 has finished for PR 18732 at commit
|
|
Test build #82599 has finished for PR 18732 at commit
|
|
Merged to master. |
|
Nice work 👍 |
|
@HyukjinKwon Thanks! Thanks to everyone for reviewing this tirelessly. |
|
A late question: shall we create another API for it instead of reusing |
|
I think @viirya raised this question too - #18732 (comment) and I think I also left few worries about thus here and there. To me, +0. |
|
@cloud-fan, it's a good question, I thought quite a bit about it and discussed with @viirya -#18732 (review) Just to recap, I think from a API perspective, having just one decorator Another thought is even if we were to introduce something like |
|
@ueshin is working on pandas UDAF, let's wait for his feedback. |
|
I'm +0 for now. As for adding pandas UDAF, I think we need another decorator or something to specify it supports partial aggregation or not and the related parameters if needed. |
|
How to name the UDF defined in this PR? |
|
Grouped UDFs, or Grouped Vectorized UDFs. |
|
I submitted a pr #19505 to introduce |
|
I am still not crazy about introducing a |
|
@icexelloss I think as an API, it's a little confusing that From my experience of Java/Scala API design, I think it's a bad idea to have a method with many parameters as flags. We'd better have more methods. For this case, |
|
@cloud-fan Thanks for your feedback. I think it makes sense to define I also agree we shouldn't add many parameters as flag. However, here are something I am not sure about:
|
Yea it's a bad idea as there are many combinations, and I just wanna use different APIs for different scenarios, e,g, Different scenarios usually have different requirements, having different APIs can help us satisfy these requirements individually. |
|
Let's discuss more on the new PR. At least we should create different UDF types in the implementation, the user-facing API can remain |
|
@cloud-fan Sounds good. Thanks! |
## What changes were proposed in this pull request? This is a follow-up of #18732. This pr modifies `GroupedData.apply()` method to convert pandas udf to grouped udf implicitly. ## How was this patch tested? Exisiting tests. Author: Takuya UESHIN <[email protected]> Closes #19517 from ueshin/issues/SPARK-20396/fup2.
What changes were proposed in this pull request?
This PR adds an apply() function on df.groupby(). apply() takes a pandas udf that is a transformation on
pandas.DataFrame->pandas.DataFrame.Static schema
Dynamic schema
This use case is removed from the PR and we will discuss this as a follow up. See discussion #18732 (review)
Another example to use pd.DataFrame dtypes as output schema of the udf:
In interactive use case, user usually have a sample pd.DataFrame to test function
fooin their notebook. Having been able to usefoo(sample_df).dtypesfrees user from specifying the output schema offoo.Design doc: https://github.com/icexelloss/spark/blob/pandas-udf-doc/docs/pyspark-pandas-udf.md
How was this patch tested?