Skip to content

Conversation

@Ishiihara
Copy link
Contributor

@mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2356 at commit 48d5e72.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2356 at commit 48d5e72.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(object):
    • class Word2Vec(object):

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2356 at commit 68e7276.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2356 at commit 68e7276.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(object):
    • class Word2Vec(object):

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2356 at commit ca1e5ff.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2356 at commit ca1e5ff.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(object):
    • class Word2Vec(object):

@mengxr
Copy link
Contributor

mengxr commented Sep 12, 2014

@davies Could you take a look at this PR and see whether there is an easier way for SerDe? Thanks!

@davies
Copy link
Contributor

davies commented Sep 12, 2014

@mengxr I'm looking into this, could we block this a few days until we find out the scalable way to do serialization?

@mengxr
Copy link
Contributor

mengxr commented Sep 16, 2014

@davies Thanks for working on MLlib's SerDe! It definitely simplifies future Python API implementations. We will wait #2378 .

@JoshRosen
Copy link
Contributor

Now that #2378 has been merged, is this unblocked?

@Ishiihara
Copy link
Contributor Author

We need to modify the implementation to use the new SerDe mechanism. Working on that now.

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
	python/pyspark/mllib/_common.py
@Ishiihara
Copy link
Contributor Author

@mengxr PR updated to use new pickle SerDe. The pickle SerDe is slow, it spends 9 out of 16 minutes in SerDe.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2356 at commit 78bbb53.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have finished for PR 2356 at commit 78bbb53.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20816/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to cache serialized data from Python, it will reduce the GC pressure (also less memory).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @davies 's suggestion

You don't need any type conversion here. word2vec.fit(dataJRDD) should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr @davies Thank you for pointing this out. I am inclined to cache words RDD inside word2vec.fit as I discovered that words RDD is used twice, the first time is calling learnVocab(words) and the second time is creating newSentences RDD. This method will not increate memory as there is no overlapping between computation on words RDD and newSentences RDD. Thus, we can unpersist words RDD before caching newSentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr @davies Also this method reduces rounghly 40s on text8 data in scala and we also eliminate the need to cache on python side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Are we always using batched serialization? The pythonToJava funciton at PythonRDD returns JavaRDD[Any]. Should I use JavaRDD[java.util.ArrayList[String]] as the return type? Thanks!

@davies
Copy link
Contributor

davies commented Sep 25, 2014

Could you add some tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order imports alphabetically (https://plugins.jetbrains.com/plugin/7350)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unpersist data after training explicitly because the user won't have access to it.

@mengxr
Copy link
Contributor

mengxr commented Oct 7, 2014

@Ishiihara Another file to update is python/docs/pyspark.mllib.rst. We need a section for pyspark.mllib.feature module.

@Ishiihara
Copy link
Contributor Author

@mengxr will take care of that and other comments

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21399/Test FAILed.

@Ishiihara
Copy link
Contributor Author

test this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21400/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have started for PR 2356 at commit daf88a6.

  • This patch does not merge cleanly!

@mengxr
Copy link
Contributor

mengxr commented Oct 7, 2014

@Ishiihara Could you try to merge master? Maybe the python doc conf changed.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2356 at commit daf88a6.

  • This patch fails unit tests.
  • This patch does not merge cleanly!

@Ishiihara
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have started for PR 2356 at commit b13a0b9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2356 at commit b13a0b9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(object):
    • class Word2Vec(object):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21408/Test FAILed.

@Ishiihara
Copy link
Contributor Author

test this please

@Ishiihara
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have started for PR 2356 at commit 476ea34.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have started for PR 2356 at commit 476ea34.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2356 at commit 476ea34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(object):
    • class Word2Vec(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21411/Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2356 at commit 476ea34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(object):
    • class Word2Vec(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21412/Test PASSed.

@asfgit asfgit closed this in 098c734 Oct 7, 2014
@mengxr
Copy link
Contributor

mengxr commented Oct 7, 2014

LGTM. Merged into master. Thanks! I created a JIRA to remember add Python code example to the user guide: https://issues.apache.org/jira/browse/SPARK-3838 . Not a high priority task, just in case we forget it before 1.2 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants