-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-43084] [SS] Add applyInPandasWithState support for spark connect #40736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43084] [SS] Add applyInPandasWithState support for spark connect #40736
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why it generates this. Will look into it
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes might be due to difference in python code generator.
Cc: @HyukjinKwon (are we planning to generated these at build time to avoid issues like this?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I have been asking around .. but seems that's a bit difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @grundprinzip and @zhengruifeng FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I saw similar annotation generated in another PR, but it was removed in some way then in that PR.
python/pyspark/sql/connect/group.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see other part of the code use quoted types, i.e. Union["StructType", str]. Maybe we should also do that for code consistency?
Also doing that seems to help with forward reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current function uses Union[StructType, str]. Also, majority of places use this in other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using quotes is only necessary when the type cannot be imported property because of cyclic references or if the type is not defined yet.
python/pyspark/sql/connect/group.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very sure if we could do it like this? Above in UserDefinedFunction, it takes the same outputStructType, but handles it differently
spark/python/pyspark/sql/connect/udf.py
Line 101 in 007a42b
| self.returnType: DataType = ( |
and then
spark/python/pyspark/sql/connect/expressions.py
Lines 595 to 598 in 007a42b
| if isinstance(self._output_type, UnparsedDataType): | |
| parsed = session._analyze( | |
| method="ddl_parse", ddl_string=self._output_type.data_type_string | |
| ).parsed |
Should we also follow this pattern?
Maybe @ueshin has more context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to port tests tests as well. We need to add connect version of test_pandas_grouped_map_with_state.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WweiL The return type of udf is a bit different. Here I am following the way how DataStreamReader handles schema. The server will parse the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rangadi Added versionchanged
|
Also I think we need to add unit test for this by reusing tests here #37894 You can follow my PR to
|
@WweiL I added a test file test_parity_pandas_grouped_map_with_state.py. However, I had to skip all tests due to spark.streams not supported in connect for now |
3360690 to
abd49b3
Compare
abd49b3 to
e4fe03d
Compare
|
Seems that there is lint error, you can run |
|
@HyukjinKwon can u help merge this? |
|
Merged to master. |
What changes were proposed in this pull request?
This change adds applyInPandasWithState support for Spark connect.
Example (try with local mode
./bin/pyspark --remote "local[*]"):Why are the changes needed?
This change adds an API support for spark connect.
Does this PR introduce any user-facing change?
This change adds an API support for spark connect.
How was this patch tested?
Manually tested.