-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7861][ML] PySpark OneVsRest #12124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
One more thing to discuss, shall we use parallel for-loop in fit() of OneVsRest just like its Scala companion? |
|
Test build #54752 has finished for PR 12124 at commit
|
|
Test build #54816 has finished for PR 12124 at commit
|
|
Using a parallel for loop sounds good to me. |
|
I'll try to figure it out. |
|
OK thanks. Hopefully there are existing examples of parfors in the codebase to work from. |
python/pyspark/ml/classification.py
Outdated
| duplicatedClassifier = classifier.__class__() | ||
| duplicatedClassifier._resetUid(classifier.uid) | ||
| classifier._copyValues(duplicatedClassifier) | ||
| return duplicatedClassifier.fit(trainingDataset, paramMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley I've added multi-thread support for OneVsRest. But what we should care about here is the copy() in spark.ml is creating a new instance, i.e. deep copy, while pyspark.ml one is a shallow copy. The shallow copy will cause a multi-thread issue in the fit method because it copies the paramMap to the current classifier.
I add the duplication here. But we also could change the copy method of pyspark.ml into deep-copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. But...I just talked with Josh, who strongly recommended not using multiprocessing for fear of some possible side-effects. Would you mind reverting the change and just training one model at a time? My apologies for the switch!
I'd like us to do multiple jobs at once in the future, but we should do more careful prototyping and testing than we have time for in Spark 2.0. I'll make a new JIRA and link it to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it's possible that multiprocessing may work depending on how the Py4J socket, locks, etc. are shared with the forked child JVMs... but yeah, there are some questions to answer. Explicit use of Thread within a single Python interpreter would probably be easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, let's remove it for this time.
|
Test build #55159 has finished for PR 12124 at commit
|
|
Test build #55157 has finished for PR 12124 at commit
|
| binaryLabelCol = "mc2b$" + str(index) | ||
| trainingDataset = multiclassLabeled.withColumn( | ||
| binaryLabelCol, | ||
| when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh, I just realized this will only work with LogisticRegression and NaiveBayes. With trees, there is no good way to set the metadata from PySpark. We'll need to document that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm hoping to fix trees to not need metadata for 2.0, if we have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's absolutely a problem since PySpark cannot handle metadata for now. I'll document it.
|
Test build #55178 has finished for PR 12124 at commit
|
|
Test build #55181 has finished for PR 12124 at commit
|
|
@jkbradley Ready for reviewing. I'll try to fix trees if there still time before 2.0. |
python/pyspark/ml/classification.py
Outdated
| """ | ||
| if extra is None: | ||
| extra = dict() | ||
| return self._copyValues(OneVsRest(self.getClassifier().copy(extra))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? I think what you had before was better.
|
Thanks for pinging me! I'll make a final pass after the merge conflicts are fixed. |
|
Test build #55795 has finished for PR 12124 at commit
|
|
@jkbradley Merged and fixed the |
python/pyspark/ml/classification.py
Outdated
| ... Row(label=1.0, features=Vectors.sparse(2, [], [])), | ||
| ... Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF() | ||
| >>> lr = LogisticRegression(maxIter=5, regParam=0.01) | ||
| >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to rename the predictionCol
python/pyspark/ml/classification.py
Outdated
| """ | ||
| Sets the value of :py:attr:`classifier`. | ||
| .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually MultilayerPerceptronClassifier is not supported since it does not have a rawPredictionCol.
|
Thanks for the updates! I made a final pass. |
|
One last comment: Since this implementation is fully in Python, could you please port some of the unit tests from OneVsRestSuite.scala to ml/tests.py? Thanks! |
|
Thanks, I am updating them now. |
python/pyspark/ml/classification.py
Outdated
| .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are | ||
| supported now. | ||
| """ | ||
| self._paramMap[self.classifier] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use _set instead. See [https://github.com//pull/11939]
|
Test build #55867 has finished for PR 12124 at commit
|
|
Test build #55868 has finished for PR 12124 at commit
|
|
@jkbradley Ready for another look |
|
Good catch on the model copy() method. |
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-7861 Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline. ## How was this patch tested? Test with doctest. Author: Xusen Yin <[email protected]> Closes apache#12124 from yinxusen/SPARK-14306-7861.
|
@jkbradley Do you still have plans to solve the metadata problem for tree methods? I find that SPARK-7126 aims to solve the problem via auto-index for DataFrame. |
|
I'm working on a simpler fix for now: [https://issues.apache.org/jira/browse/SPARK-14862] |
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-7861
Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.
How was this patch tested?
Test with doctest.