[SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec #2356

Ishiihara · 2014-09-11T09:59:37Z

@mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side

SparkQA · 2014-09-11T10:53:21Z

QA tests have started for PR 2356 at commit 48d5e72.

This patch merges cleanly.

SparkQA · 2014-09-11T10:54:29Z

QA tests have finished for PR 2356 at commit 48d5e72.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(object):
- class Word2Vec(object):

SparkQA · 2014-09-11T11:53:32Z

QA tests have started for PR 2356 at commit 68e7276.

This patch merges cleanly.

SparkQA · 2014-09-11T12:51:09Z

QA tests have finished for PR 2356 at commit 68e7276.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(object):
- class Word2Vec(object):

SparkQA · 2014-09-11T17:34:35Z

QA tests have started for PR 2356 at commit ca1e5ff.

This patch merges cleanly.

SparkQA · 2014-09-11T18:41:03Z

QA tests have finished for PR 2356 at commit ca1e5ff.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(object):
- class Word2Vec(object):

mengxr · 2014-09-12T08:30:56Z

@davies Could you take a look at this PR and see whether there is an easier way for SerDe? Thanks!

davies · 2014-09-12T16:17:49Z

@mengxr I'm looking into this, could we block this a few days until we find out the scalable way to do serialization?

mengxr · 2014-09-16T01:46:56Z

@davies Thanks for working on MLlib's SerDe! It definitely simplifies future Python API implementations. We will wait #2378 .

JoshRosen · 2014-09-22T18:46:40Z

Now that #2378 has been merged, is this unblocked?

Ishiihara · 2014-09-22T18:48:17Z

We need to modify the implementation to use the new SerDe mechanism. Working on that now.

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala python/pyspark/mllib/_common.py

Ishiihara · 2014-09-25T19:16:31Z

@mengxr PR updated to use new pickle SerDe. The pickle SerDe is slow, it spends 9 out of 16 minutes in SerDe.

SparkQA · 2014-09-25T19:19:36Z

QA tests have started for PR 2356 at commit 78bbb53.

This patch merges cleanly.

SparkQA · 2014-09-25T20:28:36Z

QA tests have finished for PR 2356 at commit 78bbb53.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-25T20:28:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20816/

davies · 2014-09-25T21:34:56Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

maybe it's better to cache serialized data from Python, it will reduce the GC pressure (also less memory).

+1 on @davies 's suggestion

You don't need any type conversion here. word2vec.fit(dataJRDD) should work.

@mengxr @davies Thank you for pointing this out. I am inclined to cache words RDD inside word2vec.fit as I discovered that words RDD is used twice, the first time is calling learnVocab(words) and the second time is creating newSentences RDD. This method will not increate memory as there is no overlapping between computation on words RDD and newSentences RDD. Thus, we can unpersist words RDD before caching newSentences.

@mengxr @davies Also this method reduces rounghly 40s on text8 data in scala and we also eliminate the need to cache on python side.

@davies Are we always using batched serialization? The pythonToJava funciton at PythonRDD returns JavaRDD[Any]. Should I use JavaRDD[java.util.ArrayList[String]] as the return type? Thanks!

davies · 2014-09-25T21:37:22Z

Could you add some tests?

mengxr · 2014-09-25T22:10:21Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

order imports alphabetically (https://plugins.jetbrains.com/plugin/7350)

mengxr · 2014-10-06T21:23:59Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

unpersist data after training explicitly because the user won't have access to it.

mengxr · 2014-10-07T00:04:48Z

@Ishiihara Another file to update is python/docs/pyspark.mllib.rst. We need a section for pyspark.mllib.feature module.

Ishiihara · 2014-10-07T00:07:02Z

@mengxr will take care of that and other comments

AmplabJenkins · 2014-10-07T20:07:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21399/Test FAILed.

Ishiihara · 2014-10-07T20:10:06Z

test this please

AmplabJenkins · 2014-10-07T20:22:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21400/Test FAILed.

SparkQA · 2014-10-07T20:25:58Z

QA tests have started for PR 2356 at commit daf88a6.

This patch does not merge cleanly!

mengxr · 2014-10-07T20:41:23Z

@Ishiihara Could you try to merge master? Maybe the python doc conf changed.

SparkQA · 2014-10-07T21:37:14Z

QA tests have finished for PR 2356 at commit daf88a6.

This patch fails unit tests.
This patch does not merge cleanly!

Conflicts: python/run-tests

Ishiihara · 2014-10-07T22:02:03Z

test this please

SparkQA · 2014-10-07T22:04:47Z

QA tests have started for PR 2356 at commit b13a0b9.

This patch merges cleanly.

SparkQA · 2014-10-07T22:05:53Z

QA tests have finished for PR 2356 at commit b13a0b9.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(object):
- class Word2Vec(object):

AmplabJenkins · 2014-10-07T22:05:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21408/Test FAILed.

Ishiihara · 2014-10-07T22:07:56Z

test this please

Ishiihara · 2014-10-07T22:13:01Z

retest this please

SparkQA · 2014-10-07T22:14:46Z

QA tests have started for PR 2356 at commit 476ea34.

This patch merges cleanly.

SparkQA · 2014-10-07T22:19:58Z

QA tests have started for PR 2356 at commit 476ea34.

This patch merges cleanly.

SparkQA · 2014-10-07T23:22:11Z

QA tests have finished for PR 2356 at commit 476ea34.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(object):
- class Word2Vec(object):

AmplabJenkins · 2014-10-07T23:22:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21411/Test PASSed.

SparkQA · 2014-10-07T23:22:32Z

QA tests have finished for PR 2356 at commit 476ea34.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Word2VecModel(object):
- class Word2Vec(object):

AmplabJenkins · 2014-10-07T23:22:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21412/Test PASSed.

mengxr · 2014-10-07T23:45:27Z

LGTM. Merged into master. Thanks! I created a JIRA to remember add Python code example to the user guide: https://issues.apache.org/jira/browse/SPARK-3838 . Not a high priority task, just in case we forget it before 1.2 release.

Ishiihara added 3 commits September 10, 2014 01:51

add Word2Vec to pyspark

c867fdf

minor fix

0ad3ac1

Functionality improvement

48d5e72

minor style fixes

68e7276

fix test

ca1e5ff

Ishiihara added 2 commits September 22, 2014 21:49

Merge remote-tracking branch 'upstream/master' into Word2Vec-python

a264b08

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala python/pyspark/mllib/_common.py

use pickle for seq string SerDe

78bbb53

davies reviewed Sep 25, 2014
View reviewed changes

mengxr reviewed Sep 25, 2014
View reviewed changes

mengxr reviewed Oct 6, 2014
View reviewed changes

modification according to feedback

daf88a6

Ishiihara added 2 commits October 7, 2014 14:37

Merge remote-tracking branch 'upstream/master' into Word2Vec-python

8671eba

Conflicts: python/run-tests

resolve merge conflicts and minor fixes

b13a0b9

style fixes

476ea34

asfgit closed this in 098c734 Oct 7, 2014

[SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec #2356

[SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec #2356

Uh oh!

Conversation

Ishiihara commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

mengxr commented Sep 12, 2014

Uh oh!

davies commented Sep 12, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

JoshRosen commented Sep 22, 2014

Uh oh!

Ishiihara commented Sep 22, 2014

Uh oh!

Ishiihara commented Sep 25, 2014

Uh oh!

SparkQA commented Sep 25, 2014

Uh oh!

SparkQA commented Sep 25, 2014

Uh oh!

AmplabJenkins commented Sep 25, 2014

Uh oh!

davies Sep 25, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Sep 25, 2014

Choose a reason for hiding this comment

Uh oh!

Ishiihara Sep 26, 2014

Choose a reason for hiding this comment

Uh oh!

Ishiihara Sep 26, 2014

Choose a reason for hiding this comment

Uh oh!

Ishiihara Sep 26, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Sep 25, 2014

Uh oh!

mengxr Sep 25, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Oct 6, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Oct 7, 2014

Uh oh!

Ishiihara commented Oct 7, 2014

Uh oh!

AmplabJenkins commented Oct 7, 2014

Uh oh!

Ishiihara commented Oct 7, 2014

Uh oh!

AmplabJenkins commented Oct 7, 2014

Uh oh!

SparkQA commented Oct 7, 2014

Uh oh!

mengxr commented Oct 7, 2014

Uh oh!

SparkQA commented Oct 7, 2014

Uh oh!

Ishiihara commented Oct 7, 2014

Uh oh!

SparkQA commented Oct 7, 2014

Uh oh!

SparkQA commented Oct 7, 2014