-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20003] [ML] FPGrowthModel setMinConfidence should affect rules generation and transform #17336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @jkbradley and @srowen to be aware of the issue. also cc @MLnick who's working on python API. |
| ).first().getAs[Seq[String]]("prediction") | ||
|
|
||
| assert(prediction === Seq("3")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't change this one, just move it to keep parameter and save/load check at the bottom.
|
Test build #74752 has finished for PR 17336 at commit
|
|
Test build #74995 has finished for PR 17336 at commit
|
|
ping @jkbradley as this is something we should fix before release. |
|
Thanks for this PR! Do you think it's worth adding the caching logic? I'm now wondering if we should change associationRules into a method which recomputes the DataFrame every time it is called. That will make it easier for the user to understand the semantics, and it would be easy for a user to define a val to hold onto the computed DataFrame if needed. Would you mind updating this accordingly? Thanks! |
|
The major thing I'm concerned is that |
|
Test build #75448 has finished for PR 17336 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning about transform(). I'd been thinking about the user calling associationRules directly. But you're right about it being better to cache for transform(). Sorry to ask this, but would you mind reverting back to the cached version? Thanks a lot.
| model2.freqItemsets.sort("items").collect()) | ||
| assert(model.freqItemsets.collect().toSet.equals( | ||
| model2.freqItemsets.collect().toSet)) | ||
| assert(model.associationRules.collect().toSet.equals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add these 2 since they are values computed from the model data. Checking freqItemsets is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comment. I added the check since we added some internal cache fields, I'd like to ensure it does not interfere with the model loading. Let me know it is still redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Let's keep it
| FPGrowthSuite.allParamSettings, checkModelData) | ||
| } | ||
|
|
||
| test("FPGrowth prediction should not contain duplicates") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, I'd prefer not to move stuff around unless it's necessary since it makes the diff larger. No need to revert this, though, since I already checked it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
|
Test build #75517 has finished for PR 17336 at commit
|
|
Thanks a lot for the second update! This LGTM |
What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel
setMinConfidenceshould always affect rules generation and transform.Currently associationRules in FPGrowthModel is a lazy val and
setMinConfidencein FPGrowthModel has no impact once associationRules got computed .I try to cache the associationRules to avoid re-computation if
minConfidenceis not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.How was this patch tested?
new unit test and I strength the unit test for model save/load to ensure the cache mechanism.