Skip to content

Conversation

@yinxusen
Copy link
Contributor

@yinxusen yinxusen commented Apr 2, 2016

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-7861

Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.

How was this patch tested?

Test with doctest.

@yinxusen
Copy link
Contributor Author

yinxusen commented Apr 2, 2016

@jkbradley @mengxr

One more thing to discuss, shall we use parallel for-loop in fit() of OneVsRest just like its Scala companion?

@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54752 has finished for PR 12124 at commit 6d30d77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54816 has finished for PR 12124 at commit b17cc7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Using a parallel for loop sounds good to me.

@yinxusen
Copy link
Contributor Author

yinxusen commented Apr 4, 2016

I'll try to figure it out.

@jkbradley
Copy link
Member

OK thanks. Hopefully there are existing examples of parfors in the codebase to work from.

duplicatedClassifier = classifier.__class__()
duplicatedClassifier._resetUid(classifier.uid)
classifier._copyValues(duplicatedClassifier)
return duplicatedClassifier.fit(trainingDataset, paramMap)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley I've added multi-thread support for OneVsRest. But what we should care about here is the copy() in spark.ml is creating a new instance, i.e. deep copy, while pyspark.ml one is a shallow copy. The shallow copy will cause a multi-thread issue in the fit method because it copies the paramMap to the current classifier.

I add the duplication here. But we also could change the copy method of pyspark.ml into deep-copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. But...I just talked with Josh, who strongly recommended not using multiprocessing for fear of some possible side-effects. Would you mind reverting the change and just training one model at a time? My apologies for the switch!

I'd like us to do multiple jobs at once in the future, but we should do more careful prototyping and testing than we have time for in Spark 2.0. I'll make a new JIRA and link it to this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it's possible that multiprocessing may work depending on how the Py4J socket, locks, etc. are shared with the forked child JVMs... but yeah, there are some questions to answer. Explicit use of Thread within a single Python interpreter would probably be easier to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, let's remove it for this time.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55159 has finished for PR 12124 at commit ecdc742.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55157 has finished for PR 12124 at commit 47bd709.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

binaryLabelCol = "mc2b$" + str(index)
trainingDataset = multiclassLabeled.withColumn(
binaryLabelCol,
when(multiclassLabeled[labelCol] == float(index), 1.0).otherwise(0.0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh, I just realized this will only work with LogisticRegression and NaiveBayes. With trees, there is no good way to set the metadata from PySpark. We'll need to document that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm hoping to fix trees to not need metadata for 2.0, if we have time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's absolutely a problem since PySpark cannot handle metadata for now. I'll document it.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55178 has finished for PR 12124 at commit cf4df64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55181 has finished for PR 12124 at commit fd4fc11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

yinxusen commented Apr 7, 2016

@jkbradley Ready for reviewing. I'll try to fix trees if there still time before 2.0.

@yinxusen
Copy link
Contributor Author

@jkbradley

"""
if extra is None:
extra = dict()
return self._copyValues(OneVsRest(self.getClassifier().copy(extra)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? I think what you had before was better.

@jkbradley
Copy link
Member

Thanks for pinging me! I'll make a final pass after the merge conflicts are fixed.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55795 has finished for PR 12124 at commit 2fb4e3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@jkbradley Merged and fixed the copy

... Row(label=1.0, features=Vectors.sparse(2, [], [])),
... Row(label=2.0, features=Vectors.dense(0.5, 0.5))]).toDF()
>>> lr = LogisticRegression(maxIter=5, regParam=0.01)
>>> ovr = OneVsRest(classifier=lr).setPredictionCol("indexed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to rename the predictionCol

"""
Sets the value of :py:attr:`classifier`.
.. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually MultilayerPerceptronClassifier is not supported since it does not have a rawPredictionCol.

@jkbradley
Copy link
Member

Thanks for the updates! I made a final pass.

@jkbradley
Copy link
Member

One last comment: Since this implementation is fully in Python, could you please port some of the unit tests from OneVsRestSuite.scala to ml/tests.py? Thanks!

@yinxusen
Copy link
Contributor Author

Thanks, I am updating them now.

.. note:: Only LogisticRegression, NaiveBayes and MultilayerPerceptronClassifier are
supported now.
"""
self._paramMap[self.classifier] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _set instead. See [https://github.com//pull/11939]

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55867 has finished for PR 12124 at commit 6002b92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55868 has finished for PR 12124 at commit 4e95ecb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@jkbradley Ready for another look

@jkbradley
Copy link
Member

Good catch on the model copy() method.
LGTM
Merging with master
Thanks @yinxusen !

@asfgit asfgit closed this in 90b46e0 Apr 15, 2016
lw-lin pushed a commit to lw-lin/spark that referenced this pull request Apr 20, 2016
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-7861

Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <[email protected]>

Closes apache#12124 from yinxusen/SPARK-14306-7861.
@yinxusen
Copy link
Contributor Author

@jkbradley Do you still have plans to solve the metadata problem for tree methods? I find that SPARK-7126 aims to solve the problem via auto-index for DataFrame.

@jkbradley
Copy link
Member

I'm working on a simpler fix for now: [https://issues.apache.org/jira/browse/SPARK-14862]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants