[SPARK-42412][WIP] Initial PR of Spark connect ML #40297

WeichenXu123 · 2023-03-06T13:33:03Z

What changes were proposed in this pull request?

Design doc:
https://docs.google.com/document/d/1V5rOgksmOnA8AsJFZ_rasSYDQuP06_vrcfp3RY_22o8/edit#

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Testing code:

run command bin/pyspark --remote local , in python REPL, run following code:

from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel

lor = LogisticRegression()
# Set params
lor.setMaxIter(2)

df0 = spark.read.format("libsvm").load("data/mllib/sample_binary_classification_data.txt")
# Train model
lor_model = lor.fit(df0)
infer_df = df0.sample(0.5)

# Prediction
prediction_df = lor_model.transform(infer_df)
prediction_df.show()

# Test model attributes
print(lor_model.coefficients)
print(lor_model.intercept)
print(lor_model.coefficientMatrix)
print(lor_model.interceptVector)

# Test model summary methods
print(lor_model.summary.featuresCol)
lor_model.summary.roc.show()
print(lor_model.summary.areaUnderROC)
lor_model.summary.pr.show()
lor_model.summary.fMeasureByThreshold.show()
lor_model.summary.precisionByThreshold.show()

print(lor_model.summary.weightedFalsePositiveRate)
print(lor_model.summary.precisionByLabel)
print(lor_model.summary.objectiveHistory)

summary2 = lor_model.evaluate(infer_df)
summary2.roc.show()
print(summary2.precisionByLabel)

# save estimator
lor.write().overwrite().save("/tmp/lore_001")
loaded_lor = LogisticRegression.load("/tmp/lore_001")

# save model
lor_model.write().overwrite().save("/tmp/lor_001")
# load model
loaded_model = LogisticRegressionModel.read().load("/tmp/lor_001")
# Test loaded model transformation
loaded_model.transform(infer_df).show()

Signed-off-by: Weichen Xu <[email protected]>

connector/connect/common/src/main/protobuf/spark/connect/ml.proto

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala

Signed-off-by: Weichen Xu <[email protected]>

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/Serializer.scala

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala

connector/connect/common/src/main/protobuf/spark/connect/ml.proto

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala

Signed-off-by: Weichen Xu <[email protected]>

grundprinzip

First round of reviews on the protos.

connector/connect/common/src/main/protobuf/spark/connect/base.proto

grundprinzip · 2023-03-14T09:26:48Z

connector/connect/common/src/main/protobuf/spark/connect/ml.proto

+    MlEvaluator evaluator = 1;
+  }
+
+  message LoadModel {


Would this work with arbitrary model for example provided by Spark NLP?

For current PR, it does not support third-party estimators.
We need to register related class for 3rd-party algorithm to AlgorithmRegistry class.

If we want to support 3rd-party algorithm without registry, then inevitably we have to use java reflection to invoke methods (e.g. We need to invoke XXXModel.load to load model, which is unsafe.

Btw, supporting 3rd-party estimators is risky, because in shared cluster we will binpack the spark workers across different customers (according to @mengxr 's explanation)
But 3rd-party estimators implementation might invoke RDD transformation (e.g. RDD.map) that we cannot isolate them by container. So it is risky if we allow user uses 3rd-party estimators on shared cluster.

grundprinzip · 2023-03-14T09:27:50Z

connector/connect/common/src/main/protobuf/spark/connect/ml_common.proto

+  MlParams params = 2;
+  string uid = 3;
+  StageType type = 4;
+  enum StageType {


Is this knowledge actually required on the client?

Or we can make server side infer the stage type from stage name,
but let client fill the stage type is easier for code.

connector/connect/common/src/main/protobuf/spark/connect/ml_common.proto

grundprinzip · 2023-03-14T09:30:29Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  }
+  message ModelTransform {
+    Relation input = 1;
+    int64 model_ref_id = 2;


My suggestion here is to maybe wrap the moddel_ref_id into an extra message object that becomes easier to extend.

message ModelRef { int64 id = 1; }

That said, is there a reason the ID is numeric vs a string?

The ID is generated from a increamental counter. So I think int64 type should be fine.

message ModelRef {
int64 id = 1;
}

This sounds good.

The ID is generated from a increamental counter.

Using random UUID might be a better idea , if we want to support server failover in future (we need to persist status and restore it, random UUID can help avoiding reusing ID that is generated before.)

connector/connect/common/src/main/protobuf/spark/connect/ml.proto

Signed-off-by: Weichen Xu <[email protected]>

python/pyspark/sql/connect/ml/base.py

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng · 2023-03-15T11:45:48Z

python/pyspark/sql/connect/ml/utils.py

+    return remote_cls
+
+
+def try_remote_ml_class(x):


I feel we can also simplify the pyspark.sql side by only using this annotation to the a few key classes

cc @HyukjinKwon

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng · 2023-03-16T08:52:54Z

mllib/common/src/main/scala/org/apache/spark/ml/param/params.scala

+  private[spark] def _setDefault(paramPairs: ParamPair[_]*): this.type = {
+    setDefault(paramPairs: _*)
+  }


I think we can simply change setDefault to protected[spark] ?

I think we can simply change setDefault to protected[spark] ?

This should be a breaking change.

Some 3rd-party estimator might override this method, if they are not under "org.apache" package, then compiling will fail.

zhengruifeng · 2023-03-16T08:53:32Z

python/pyspark/ml/base.py


 from abc import ABCMeta, abstractmethod

+import os


is this import needed?

zhengruifeng · 2023-03-16T08:58:51Z

python/pyspark/sql/connect/session.py


    @classmethod
    def getActiveSession(cls) -> Any:
-        raise NotImplementedError("getActiveSession() is not implemented.")


do we need this change ? I thought we can use the newly added getOrCreate

oh. I will revert this.

zhengruifeng · 2023-03-16T09:00:20Z

connector/connect/common/src/main/protobuf/spark/connect/ml_common.proto

+    UNSPECIFIED = 0;
+    ESTIMATOR = 1;
+    TRANSFORMER = 2;


we normally name enums like this

STAGE_TYPE_UNSPECIFIED = 0; STAGE_TYPE_ESTIMATOR = 1; STAGE_TYPE_TRANSFORMER = 2;

zhengruifeng · 2023-03-16T09:11:21Z

python/pyspark/ml/connect/classification.py

+    globs = pyspark.sql.connect.dataframe.__dict__.copy()
+
+    globs["spark"] = (
+        PySparkSession.builder.appName("sql.connect.ml.classification tests")


Suggested change

PySparkSession.builder.appName("sql.connect.ml.classification tests")

PySparkSession.builder.appName("ml.connect.classification tests")

doctest should be added in sparktestsupport/modules.py

zhengruifeng · 2023-03-16T09:17:35Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala

@@ -0,0 +1,61 @@
+/*


will we move these ml files to connector/connect/server/src/main/scala/org/apache/spark/ml/connect ?

WeichenXu123 · 2023-03-16T12:58:16Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala

+  }
+}
+
+class LogisticRegressionAlgorithm extends Algorithm {


If we can use java reflection to invoke methods, we don't need the registry class, we just need some configuration data for registry.

If we plan to mandatorily enable spark connect mode since spark 4 for DBR, then we'd better use java reflection invocation. Otherwise it is hard to support huge number of 3rd-party estimators.

CC @grundprinzip

Signed-off-by: Weichen Xu <[email protected]>

github-actions · 2023-07-01T00:26:50Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

WeichenXu123 added 4 commits March 3, 2023 21:58

update

684a9e3

Signed-off-by: Weichen Xu <[email protected]>

update

97ab924

Signed-off-by: Weichen Xu <[email protected]>

update

86f2fad

Signed-off-by: Weichen Xu <[email protected]>

update

4f17d6c

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 marked this pull request as draft March 6, 2023 13:33

github-actions bot added BUILD CONNECT CORE ML PYTHON SQL labels Mar 6, 2023

zhengruifeng reviewed Mar 7, 2023

View reviewed changes

merge master & fix

6582037

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng reviewed Mar 7, 2023

View reviewed changes

zhengruifeng reviewed Mar 8, 2023

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/ml.proto Show resolved Hide resolved

zhengruifeng reviewed Mar 8, 2023

View reviewed changes

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala Outdated Show resolved Hide resolved

zhengruifeng reviewed Mar 8, 2023

View reviewed changes

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala Outdated Show resolved Hide resolved

zhengruifeng reviewed Mar 8, 2023

View reviewed changes

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala Outdated Show resolved Hide resolved

WeichenXu123 added 12 commits March 8, 2023 18:34

update

ba4f580

Signed-off-by: Weichen Xu <[email protected]>

update

941550e

Signed-off-by: Weichen Xu <[email protected]>

update

c4473a6

Signed-off-by: Weichen Xu <[email protected]>

update

606168d

update

1003787

Signed-off-by: Weichen Xu <[email protected]>

update

f9f3542

Signed-off-by: Weichen Xu <[email protected]>

fix

ed24307

Signed-off-by: Weichen Xu <[email protected]>

merge master

c1f9162

Signed-off-by: Weichen Xu <[email protected]>

update

e178de3

Signed-off-by: Weichen Xu <[email protected]>

update

130bd1e

Signed-off-by: Weichen Xu <[email protected]>

fix

eee1013

Signed-off-by: Weichen Xu <[email protected]>

update

d72fba0

Signed-off-by: Weichen Xu <[email protected]>

grundprinzip reviewed Mar 14, 2023

View reviewed changes

Merge branch 'master' into spark-connect-ml-1

e5278cb

github-actions bot removed the BUILD label Mar 14, 2023

add proto comments

c80414d

Signed-off-by: Weichen Xu <[email protected]>

harupy reviewed Mar 15, 2023

View reviewed changes

python/pyspark/sql/connect/ml/base.py Show resolved Hide resolved

harupy reviewed Mar 15, 2023

View reviewed changes

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala Outdated Show resolved Hide resolved

harupy reviewed Mar 15, 2023

View reviewed changes

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala Outdated Show resolved Hide resolved

harupy reviewed Mar 15, 2023

View reviewed changes

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala Show resolved Hide resolved

address comments

66c472b

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng reviewed Mar 15, 2023

View reviewed changes

WeichenXu123 added 4 commits March 16, 2023 12:20

model_ref message

39b2f24

Signed-off-by: Weichen Xu <[email protected]>

move to ml.connect

2d4377d

Signed-off-by: Weichen Xu <[email protected]>

Merge branch 'master' into spark-connect-ml-1

e316a82

fix

caebf75

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng reviewed Mar 16, 2023

View reviewed changes

WeichenXu123 commented Mar 16, 2023

View reviewed changes

WeichenXu123 added 2 commits March 20, 2023 15:45

merge master

6689a1d

Signed-off-by: Weichen Xu <[email protected]>

update

2ad0e13

Signed-off-by: Weichen Xu <[email protected]>

github-actions bot added the BUILD label Mar 21, 2023

WeichenXu123 added 7 commits March 21, 2023 16:17

update

49e0e0c

Signed-off-by: Weichen Xu <[email protected]>

update

809bd00

Signed-off-by: Weichen Xu <[email protected]>

update doctest

def7aa5

Signed-off-by: Weichen Xu <[email protected]>

update

581a5ee

Signed-off-by: Weichen Xu <[email protected]>

merge master

296099c

Signed-off-by: Weichen Xu <[email protected]>

update

93b00f8

Signed-off-by: Weichen Xu <[email protected]>

update

f227df9

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng mentioned this pull request Apr 14, 2023

[SPARK-43084] [SS] Add applyInPandasWithState support for spark connect #40736

Closed

github-actions bot added the Stale label Jul 1, 2023

github-actions bot closed this Jul 2, 2023

	PySparkSession.builder.appName("sql.connect.ml.classification tests")
	PySparkSession.builder.appName("ml.connect.classification tests")

[SPARK-42412][WIP] Initial PR of Spark connect ML #40297

[SPARK-42412][WIP] Initial PR of Spark connect ML #40297

Uh oh!

Conversation

WeichenXu123 commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Mar 6, 2023 •

edited

Loading

WeichenXu123 Mar 14, 2023 •

edited

Loading

WeichenXu123 Mar 14, 2023 •

edited

Loading

WeichenXu123 Mar 16, 2023 •

edited

Loading