-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec #2356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2356 at commit
|
|
QA tests have finished for PR 2356 at commit
|
|
QA tests have started for PR 2356 at commit
|
|
QA tests have finished for PR 2356 at commit
|
|
QA tests have started for PR 2356 at commit
|
|
QA tests have finished for PR 2356 at commit
|
|
@davies Could you take a look at this PR and see whether there is an easier way for SerDe? Thanks! |
|
@mengxr I'm looking into this, could we block this a few days until we find out the scalable way to do serialization? |
|
Now that #2378 has been merged, is this unblocked? |
|
We need to modify the implementation to use the new SerDe mechanism. Working on that now. |
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala python/pyspark/mllib/_common.py
|
@mengxr PR updated to use new pickle SerDe. The pickle SerDe is slow, it spends 9 out of 16 minutes in SerDe. |
|
QA tests have started for PR 2356 at commit
|
|
QA tests have finished for PR 2356 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to cache serialized data from Python, it will reduce the GC pressure (also less memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @davies 's suggestion
You don't need any type conversion here. word2vec.fit(dataJRDD) should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr @davies Thank you for pointing this out. I am inclined to cache words RDD inside word2vec.fit as I discovered that words RDD is used twice, the first time is calling learnVocab(words) and the second time is creating newSentences RDD. This method will not increate memory as there is no overlapping between computation on words RDD and newSentences RDD. Thus, we can unpersist words RDD before caching newSentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies Are we always using batched serialization? The pythonToJava funciton at PythonRDD returns JavaRDD[Any]. Should I use JavaRDD[java.util.ArrayList[String]] as the return type? Thanks!
|
Could you add some tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order imports alphabetically (https://plugins.jetbrains.com/plugin/7350)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unpersist data after training explicitly because the user won't have access to it.
|
@Ishiihara Another file to update is |
|
@mengxr will take care of that and other comments |
|
Test FAILed. |
|
test this please |
|
Test FAILed. |
|
QA tests have started for PR 2356 at commit
|
|
@Ishiihara Could you try to merge master? Maybe the python doc conf changed. |
|
QA tests have finished for PR 2356 at commit
|
Conflicts: python/run-tests
|
test this please |
|
QA tests have started for PR 2356 at commit
|
|
QA tests have finished for PR 2356 at commit
|
|
Test FAILed. |
|
test this please |
|
retest this please |
|
QA tests have started for PR 2356 at commit
|
|
QA tests have started for PR 2356 at commit
|
|
QA tests have finished for PR 2356 at commit
|
|
Test PASSed. |
|
QA tests have finished for PR 2356 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. Thanks! I created a JIRA to remember add Python code example to the user guide: https://issues.apache.org/jira/browse/SPARK-3838 . Not a high priority task, just in case we forget it before 1.2 release. |
@mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side