[SPARK-43084] [SS] Add applyInPandasWithState support for spark connect #40736

pengzhon-db · 2023-04-11T05:39:57Z

What changes were proposed in this pull request?

This change adds applyInPandasWithState support for Spark connect.
Example (try with local mode ./bin/pyspark --remote "local[*]"):

>>> from pyspark.sql.streaming.state import GroupStateTimeout, GroupState
>>> from pyspark.sql.types import (
...     LongType,
...     StringType,
...     StructType,
...     StructField,
...     Row,
... )
>>> import pandas as pd
>>> output_type = StructType(
...     [StructField("key", StringType()), StructField("countAsString", StringType())]
... )
>>> state_type = StructType([StructField("c", LongType())])
>>> def func(key, pdf_iter, state):
...     total_len = 0
...     for pdf in pdf_iter:
...         total_len += len(pdf)
...     state.update((total_len,))
...     yield pd.DataFrame({"key": [key[0]], "countAsString": [str(total_len)]})
...
>>>
>>> input_path = "/Users/peng.zhong/tmp/applyInPandasWithState"
>>> df = spark.readStream.format("text").load(input_path)
>>> q = (
...       df.groupBy(df["value"])
...       .applyInPandasWithState(
...           func, output_type, state_type, "Update", GroupStateTimeout.NoTimeout
...       )
...       .writeStream.queryName("this_query")
...       .format("memory")
...       .outputMode("update")
...       .start()
...   )
>>>
>>> q.status
{'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
>>>
>>> spark.sql("select * from this_query").show()
+-----+-------------+
|  key|countAsString|
+-----+-------------+
|hello|            1|
| this|            1|
+-----+-------------+

Why are the changes needed?

This change adds an API support for spark connect.

Does this PR introduce any user-facing change?

This change adds an API support for spark connect.

How was this patch tested?

Manually tested.

pengzhon-db · 2023-04-11T16:39:47Z

python/pyspark/sql/connect/proto/base_pb2.pyi

Not sure why it generates this. Will look into it

rangadi

LGTM.

python/pyspark/sql/connect/group.py

rangadi · 2023-04-11T17:10:26Z

python/pyspark/sql/connect/proto/types_pb2.pyi

These changes might be due to difference in python code generator.
Cc: @HyukjinKwon (are we planning to generated these at build time to avoid issues like this?).

That's exactly what I have been asking around .. but seems that's a bit difficult.

cc @grundprinzip and @zhengruifeng FYI

I think I saw similar annotation generated in another PR, but it was removed in some way then in that PR.

python/pyspark/sql/connect/proto/catalog_pb2.pyi

WweiL · 2023-04-11T18:14:43Z

python/pyspark/sql/connect/group.py

I see other part of the code use quoted types, i.e. Union["StructType", str]. Maybe we should also do that for code consistency?
Also doing that seems to help with forward reference

The current function uses Union[StructType, str]. Also, majority of places use this in other files.

Using quotes is only necessary when the type cannot be imported property because of cyclic references or if the type is not defined yet.

WweiL · 2023-04-11T18:55:54Z

python/pyspark/sql/connect/group.py

I'm not very sure if we could do it like this? Above in UserDefinedFunction, it takes the same outputStructType, but handles it differently

spark/python/pyspark/sql/connect/udf.py

Line 101 in 007a42b

self.returnType: DataType = (

and then

spark/python/pyspark/sql/connect/expressions.py

Lines 595 to 598 in 007a42b

if isinstance(self._output_type, UnparsedDataType):

parsed = session._analyze(

method="ddl_parse", ddl_string=self._output_type.data_type_string

).parsed

Should we also follow this pattern?

Maybe @ueshin has more context?

Good idea to port tests tests as well. We need to add connect version of test_pandas_grouped_map_with_state.py

@WweiL The return type of udf is a bit different. Here I am following the way how DataStreamReader handles schema. The server will parse the schema.

@rangadi Added versionchanged

WweiL · 2023-04-11T19:10:35Z

Also I think we need to add unit test for this by reusing tests here #37894

You can follow my PR to

create a new mixin class that contains all test cases from the original one but don't extend ReusedSQLTestCase, and
create a new class below with the original name and extend this mixin class and ReusedSQLTestCase
create a parity test class that extends this mixin class and ReusedConnectTestCase

pengzhon-db · 2023-04-12T17:56:56Z

Also I think we need to add unit test for this by reusing tests here #37894

You can follow my PR to

create a new mixin class that contains all test cases from the original one but don't extend ReusedSQLTestCase, and

create a new class below with the original name and extend this mixin class and ReusedSQLTestCase

create a parity test class that extends this mixin class and ReusedConnectTestCase

@WweiL I added a test file test_parity_pandas_grouped_map_with_state.py. However, I had to skip all tests due to spark.streams not supported in connect for now

WweiL · 2023-04-14T20:56:21Z

Seems that there is lint error, you can run PYTHON_EXECUTABLE=python3.9 ./dev/lint-python or just ./dev/lint-python before commit to make sure

pengzhon-db · 2023-04-17T04:16:47Z

@HyukjinKwon can u help merge this?

HyukjinKwon · 2023-04-18T00:21:24Z

Merged to master.

github-actions bot added CONNECT CORE PYTHON SQL labels Apr 11, 2023

pengzhon-db commented Apr 11, 2023

View reviewed changes

python/pyspark/sql/connect/proto/base_pb2.pyi Outdated

Copy link

Contributor Author

pengzhon-db Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why it generates this. Will look into it

rangadi approved these changes Apr 11, 2023

View reviewed changes

WweiL reviewed Apr 11, 2023

View reviewed changes

HyukjinKwon approved these changes Apr 12, 2023

View reviewed changes

github-actions bot added the BUILD label Apr 12, 2023

rangadi approved these changes Apr 12, 2023

View reviewed changes

github-actions bot added DOCS INFRA ML PANDAS API ON SPARK R STRUCTURED STREAMING YARN labels Apr 12, 2023

pengzhon-db force-pushed the connect_applyInPandasWithState branch from 3360690 to abd49b3 Compare April 12, 2023 19:25

github-actions bot removed ML DOCS INFRA PANDAS API ON SPARK R YARN STRUCTURED STREAMING labels Apr 12, 2023

pengzhon-db added 3 commits April 13, 2023 11:57

Add spark connect for applyInPandasWithState

514ca81

change schema type to string

4d9429e

remove unused code

b1d6471

pengzhon-db added 4 commits April 13, 2023 11:57

format

c0b32a3

proto change

2099668

add test file

85be6c1

add versionchanged

e4fe03d

pengzhon-db force-pushed the connect_applyInPandasWithState branch from abd49b3 to e4fe03d Compare April 13, 2023 18:59

pengzhon-db added 3 commits April 13, 2023 12:03

remove unsupported group function test

4878b5b

formatting

3453ce2

python format

2190975

pengzhon-db added 2 commits April 16, 2023 13:26

format

393d2de

revert change

e202424

HyukjinKwon closed this in cbe94a1 Apr 18, 2023

	if isinstance(self._output_type, UnparsedDataType):
	parsed = session._analyze(
	method="ddl_parse", ddl_string=self._output_type.data_type_string
	).parsed

[SPARK-43084] [SS] Add applyInPandasWithState support for spark connect #40736

[SPARK-43084] [SS] Add applyInPandasWithState support for spark connect #40736

Uh oh!

Conversation

pengzhon-db commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WweiL Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengzhon-db commented Apr 12, 2023

Uh oh!

WweiL commented Apr 14, 2023

Uh oh!

pengzhon-db commented Apr 17, 2023

Uh oh!

HyukjinKwon commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pengzhon-db commented Apr 11, 2023 •

edited

Loading

WweiL Apr 11, 2023 •

edited

Loading

WweiL Apr 11, 2023 •

edited

Loading

WweiL commented Apr 11, 2023 •

edited

Loading