-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8467][MLlib][PySpark] Add LDAModel.describeTopics() in Python #8643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #42096 has finished for PR 8643 at commit
|
|
Test build #42099 has finished for PR 8643 at commit
|
|
@yu-iskw Rather than using Java Any types and the old serialization patterns, would it be easier to convert to a local DataFrame? We should be able to take advantage of DataFrame serialization. |
|
@jkbradley thank you for the comment. Just to be sure, @davies what do you think? |
|
@yu-iskw I think it's OK to use DataFrame internally in spark.mllib. It already has the dependency, and it would be a private API. |
|
@jkbradley sorry for the delay of my update. I tried to use DataFrame serialization at yu-iskw@2f70193. Could you review it? |
|
Test build #44182 has finished for PR 8643 at commit
|
|
Test build #44799 has finished for PR 8643 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialize a DataFrame will trigger a Spark job, we could still use Pickle to serialize them without DataFrame, via PythomMLLibAPI.dumps()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies thanks for the comment. Should we rather PythonMLlibAPI.dmups() than Java Any types like below?
yu-iskw@e1c66d0#diff-71f42172be0b5fc14827b7bb31f4e80bR34
|
Test build #44881 has finished for PR 8643 at commit
|
|
I reverted DataFrame serialization to Java Any types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this line could be Array[Any](terms, termWeights)
|
Test build #45250 has finished for PR 8643 at commit
|
|
Jenkins, test this please |
|
jenkins, test this please |
|
Test build #45258 has finished for PR 8643 at commit
|
|
@jkbradley @davies could you review it? I modified the type conversion using |
python/pyspark/mllib/clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no space around =
python/pyspark/mllib/clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could still call it model
|
LGTM, but a few minor comments. |
|
Jenkins, test this please |
|
Test build #2002 has finished for PR 8643 at commit
|
|
@davies thanks for the review. I fixed them. |
|
LGTM, merging this into master and 1.6 branch, thanks! |
Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <[email protected]> Closes #8643 from yu-iskw/SPARK-8467-2.
|
Thank you for merging it and your great support! |
|
@yu-iskw Thanks for this! Quick request: Could you please send a little follow-up PR to document (in the Python doc) what is being returned? |
|
@jkbradley sure! |
|
@jkbradley I send the PR at #9577. |
Could @jkbradley and @davies review it?
LDAModelWrapperforLDAModel. Because we can't deal with the return value ofdescribeTopicsin Scala from pyspark directly.Array[(Array[Int], Array[Double])]is too complicated to convert it.loadLDAModelinPythonMLlibAPI. SinceLDAModelin Scala is an abstract class and we need to callloadofDistributedLDAModel.[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA