Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Mar 6, 2023

What changes were proposed in this pull request?

Design doc:
https://docs.google.com/document/d/1V5rOgksmOnA8AsJFZ_rasSYDQuP06_vrcfp3RY_22o8/edit#

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Testing code:

run command bin/pyspark --remote local , in python REPL, run following code:

from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel

lor = LogisticRegression()
# Set params
lor.setMaxIter(2)

df0 = spark.read.format("libsvm").load("data/mllib/sample_binary_classification_data.txt")
# Train model
lor_model = lor.fit(df0)
infer_df = df0.sample(0.5)

# Prediction
prediction_df = lor_model.transform(infer_df)
prediction_df.show()

# Test model attributes
print(lor_model.coefficients)
print(lor_model.intercept)
print(lor_model.coefficientMatrix)
print(lor_model.interceptVector)

# Test model summary methods
print(lor_model.summary.featuresCol)
lor_model.summary.roc.show()
print(lor_model.summary.areaUnderROC)
lor_model.summary.pr.show()
lor_model.summary.fMeasureByThreshold.show()
lor_model.summary.precisionByThreshold.show()

print(lor_model.summary.weightedFalsePositiveRate)
print(lor_model.summary.precisionByLabel)
print(lor_model.summary.objectiveHistory)

summary2 = lor_model.evaluate(infer_df)
summary2.roc.show()
print(summary2.precisionByLabel)

# save estimator
lor.write().overwrite().save("/tmp/lore_001")
loaded_lor = LogisticRegression.load("/tmp/lore_001")

# save model
lor_model.write().overwrite().save("/tmp/lor_001")
# load model
loaded_model = LogisticRegressionModel.read().load("/tmp/lor_001")
# Test loaded model transformation
loaded_model.transform(infer_df).show()

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of reviews on the protos.

MlEvaluator evaluator = 1;
}

message LoadModel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work with arbitrary model for example provided by Spark NLP?

Copy link
Contributor Author

@WeichenXu123 WeichenXu123 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For current PR, it does not support third-party estimators.
We need to register related class for 3rd-party algorithm to AlgorithmRegistry class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support 3rd-party algorithm without registry, then inevitably we have to use java reflection to invoke methods (e.g. We need to invoke XXXModel.load to load model, which is unsafe.

Copy link
Contributor Author

@WeichenXu123 WeichenXu123 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, supporting 3rd-party estimators is risky, because in shared cluster we will binpack the spark workers across different customers (according to @mengxr 's explanation)
But 3rd-party estimators implementation might invoke RDD transformation (e.g. RDD.map) that we cannot isolate them by container. So it is risky if we allow user uses 3rd-party estimators on shared cluster.

MlParams params = 2;
string uid = 3;
StageType type = 4;
enum StageType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this knowledge actually required on the client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can make server side infer the stage type from stage name,
but let client fill the stage type is easier for code.

}
message ModelTransform {
Relation input = 1;
int64 model_ref_id = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion here is to maybe wrap the moddel_ref_id into an extra message object that becomes easier to extend.

message ModelRef {
  int64 id = 1;
}

That said, is there a reason the ID is numeric vs a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID is generated from a increamental counter. So I think int64 type should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message ModelRef {
int64 id = 1;
}

This sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID is generated from a increamental counter.

Using random UUID might be a better idea , if we want to support server failover in future (we need to persist status and restore it, random UUID can help avoiding reusing ID that is generated before.)

@github-actions github-actions bot removed the BUILD label Mar 14, 2023
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
return remote_cls


def try_remote_ml_class(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can also simplify the pyspark.sql side by only using this annotation to the a few key classes

cc @HyukjinKwon

Comment on lines +803 to +805
private[spark] def _setDefault(paramPairs: ParamPair[_]*): this.type = {
setDefault(paramPairs: _*)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply change setDefault to protected[spark] ?

Copy link
Contributor Author

@WeichenXu123 WeichenXu123 Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply change setDefault to protected[spark] ?

This should be a breaking change.

Some 3rd-party estimator might override this method, if they are not under "org.apache" package, then compiling will fail.


from abc import ABCMeta, abstractmethod

import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this import needed?


@classmethod
def getActiveSession(cls) -> Any:
raise NotImplementedError("getActiveSession() is not implemented.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this change ? I thought we can use the newly added getOrCreate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. I will revert this.

Comment on lines 47 to 49
UNSPECIFIED = 0;
ESTIMATOR = 1;
TRANSFORMER = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally name enums like this

    STAGE_TYPE_UNSPECIFIED = 0;
    STAGE_TYPE_ESTIMATOR = 1;
    STAGE_TYPE_TRANSFORMER = 2;

globs = pyspark.sql.connect.dataframe.__dict__.copy()

globs["spark"] = (
PySparkSession.builder.appName("sql.connect.ml.classification tests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PySparkSession.builder.appName("sql.connect.ml.classification tests")
PySparkSession.builder.appName("ml.connect.classification tests")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doctest should be added in sparktestsupport/modules.py

@@ -0,0 +1,61 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we move these ml files to connector/connect/server/src/main/scala/org/apache/spark/ml/connect ?

}
}

class LogisticRegressionAlgorithm extends Algorithm {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can use java reflection to invoke methods, we don't need the registry class, we just need some configuration data for registry.

If we plan to mandatorily enable spark connect mode since spark 4 for DBR, then we'd better use java reflection invocation. Otherwise it is hard to support huge number of 3rd-party estimators.

CC @grundprinzip

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@github-actions github-actions bot added the BUILD label Mar 21, 2023
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@github-actions
Copy link

github-actions bot commented Jul 1, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 1, 2023
@github-actions github-actions bot closed this Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants