[SPARK-7861][ML] PySpark OneVsRest #12124

yinxusen · 2016-04-02T02:46:44Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-7861

Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.

How was this patch tested?

Test with doctest.

yinxusen · 2016-04-02T02:47:30Z

@jkbradley @mengxr

One more thing to discuss, shall we use parallel for-loop in fit() of OneVsRest just like its Scala companion?

SparkQA · 2016-04-02T03:00:20Z

Test build #54752 has finished for PR 12124 at commit 6d30d77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T00:50:19Z

Test build #54816 has finished for PR 12124 at commit b17cc7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-04T22:17:31Z

Using a parallel for loop sounds good to me.

yinxusen · 2016-04-04T22:27:53Z

I'll try to figure it out.

jkbradley · 2016-04-04T22:31:42Z

OK thanks. Hopefully there are existing examples of parfors in the codebase to work from.

yinxusen · 2016-04-06T22:37:03Z

python/pyspark/ml/classification.py

+            duplicatedClassifier = classifier.__class__()
+            duplicatedClassifier._resetUid(classifier.uid)
+            classifier._copyValues(duplicatedClassifier)
+            return duplicatedClassifier.fit(trainingDataset, paramMap)


@jkbradley I've added multi-thread support for OneVsRest. But what we should care about here is the copy() in spark.ml is creating a new instance, i.e. deep copy, while pyspark.ml one is a shallow copy. The shallow copy will cause a multi-thread issue in the fit method because it copies the paramMap to the current classifier.

I add the duplication here. But we also could change the copy method of pyspark.ml into deep-copy.

Thanks for doing this. But...I just talked with Josh, who strongly recommended not using multiprocessing for fear of some possible side-effects. Would you mind reverting the change and just training one model at a time? My apologies for the switch!

I'd like us to do multiple jobs at once in the future, but we should do more careful prototyping and testing than we have time for in Spark 2.0. I'll make a new JIRA and link it to this one.

I mean, it's possible that multiprocessing may work depending on how the Py4J socket, locks, etc. are shared with the forked child JVMs... but yeah, there are some questions to answer. Explicit use of Thread within a single Python interpreter would probably be easier to reason about.

No problem, let's remove it for this time.

SparkQA · 2016-04-06T22:48:14Z

Test build #55159 has finished for PR 12124 at commit ecdc742.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-07T00:27:12Z

Test build #55157 has finished for PR 12124 at commit 47bd709.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

jkbradley · 2016-04-07T02:20:16Z

python/pyspark/ml/classification.py

+            binaryLabelCol = "mc2b$" + str(index)
+            trainingDataset = multiclassLabeled.withColumn(
+                binaryLabelCol,
+                when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))


Uh oh, I just realized this will only work with LogisticRegression and NaiveBayes. With trees, there is no good way to set the metadata from PySpark. We'll need to document that.

But I'm hoping to fix trees to not need metadata for 2.0, if we have time.

Yeah, that's absolutely a problem since PySpark cannot handle metadata for now. I'll document it.

SparkQA · 2016-04-07T04:28:35Z

Test build #55178 has finished for PR 12124 at commit cf4df64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-07T04:36:53Z

Test build #55181 has finished for PR 12124 at commit fd4fc11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-04-07T04:37:30Z

@jkbradley Ready for reviewing. I'll try to fix trees if there still time before 2.0.

yinxusen · 2016-04-13T18:48:33Z

@jkbradley

jkbradley · 2016-04-13T22:48:48Z

python/pyspark/ml/classification.py

+        """
+        if extra is None:
+            extra = dict()
+        return self._copyValues(OneVsRest(self.getClassifier().copy(extra)))


Is this correct? I think what you had before was better.

jkbradley · 2016-04-13T22:49:24Z

Thanks for pinging me! I'll make a final pass after the merge conflicts are fixed.

SparkQA · 2016-04-14T05:00:59Z

Test build #55795 has finished for PR 12124 at commit 2fb4e3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-04-14T05:02:37Z

@jkbradley Merged and fixed the copy

jkbradley · 2016-04-14T20:32:26Z

python/pyspark/ml/classification.py

+    ...     Row(label=1.0, features=Vectors.sparse(2, [], [])),
+    ...     Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
+    >>> lr = LogisticRegression(maxIter=5, regParam=0.01)
+    >>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")


No need to rename the predictionCol

jkbradley · 2016-04-14T20:32:30Z

python/pyspark/ml/classification.py

+        """
+        Sets the value of :py:attr:`classifier`.
+
+        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are


Actually MultilayerPerceptronClassifier is not supported since it does not have a rawPredictionCol.

jkbradley · 2016-04-14T20:34:41Z

Thanks for the updates! I made a final pass.

jkbradley · 2016-04-14T20:42:38Z

One last comment: Since this implementation is fully in Python, could you please port some of the unit tests from OneVsRestSuite.scala to ml/tests.py? Thanks!

yinxusen · 2016-04-14T20:45:24Z

Thanks, I am updating them now.

jkbradley · 2016-04-14T20:59:20Z

python/pyspark/ml/classification.py

+        .. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
+                  supported now.
+        """
+        self._paramMap[self.classifier] = value


Use _set instead. See [https://github.com//pull/11939]

SparkQA · 2016-04-14T23:41:30Z

Test build #55867 has finished for PR 12124 at commit 6002b92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-14T23:46:20Z

Test build #55868 has finished for PR 12124 at commit 4e95ecb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-04-15T17:05:20Z

@jkbradley Ready for another look

jkbradley · 2016-04-15T19:58:09Z

Good catch on the model copy() method.
LGTM
Merging with master
Thanks @yinxusen !

## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-7861 Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline. ## How was this patch tested? Test with doctest. Author: Xusen Yin <[email protected]> Closes apache#12124 from yinxusen/SPARK-14306-7861.

yinxusen · 2016-04-23T00:38:33Z

@jkbradley Do you still have plans to solve the metadata problem for tree methods? I find that SPARK-7126 aims to solve the problem via auto-index for DataFrame.

jkbradley · 2016-04-23T04:46:38Z

I'm working on a simpler fix for now: [https://issues.apache.org/jira/browse/SPARK-14862]

yinxusen added 4 commits March 31, 2016 16:52

initial add for OneVsRest

84f292b

ser/de error

a296a86

fix error caused by treating nparray as list

417d13f

add copy and more tests

6d30d77

fix nits

b17cc7b

yinxusen added 2 commits April 6, 2016 15:29

fix multi-thread issue

47bd709

merge with master

ecdc742

yinxusen reviewed Apr 6, 2016
View reviewed changes

jkbradley reviewed Apr 7, 2016
View reviewed changes

yinxusen added 2 commits April 6, 2016 21:16

revert non-parallel process

cf4df64

use copyValues

fd4fc11

jkbradley reviewed Apr 13, 2016
View reviewed changes

merge with master

2fb4e3d

jkbradley reviewed Apr 14, 2016
View reviewed changes

yinxusen added 4 commits April 14, 2016 14:41

fix nits

fb337cf

add unit tests

e0cf36f

merge with master

6002b92

fix typo

4e95ecb

asfgit closed this in 90b46e0 Apr 15, 2016

[SPARK-7861][ML] PySpark OneVsRest #12124

[SPARK-7861][ML] PySpark OneVsRest #12124

Uh oh!

Conversation

yinxusen commented Apr 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yinxusen commented Apr 2, 2016

Uh oh!

SparkQA commented Apr 2, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

jkbradley commented Apr 4, 2016

Uh oh!

yinxusen commented Apr 4, 2016

Uh oh!

jkbradley commented Apr 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 7, 2016

Uh oh!

SparkQA commented Apr 7, 2016

Uh oh!

yinxusen commented Apr 7, 2016

Uh oh!

yinxusen commented Apr 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

yinxusen commented Apr 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Apr 14, 2016

Uh oh!

jkbradley commented Apr 14, 2016

Uh oh!

yinxusen commented Apr 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

yinxusen commented Apr 15, 2016

Uh oh!

jkbradley commented Apr 15, 2016

Uh oh!

yinxusen commented Apr 23, 2016

Uh oh!

jkbradley commented Apr 23, 2016

Uh oh!

Reviewers

Assignees