[SPARK-20003] [ML] FPGrowthModel setMinConfidence should affect rules generation and transform #17336

hhbyyh · 2017-03-17T18:46:14Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel setMinConfidence should always affect rules generation and transform.
Currently associationRules in FPGrowthModel is a lazy val and setMinConfidence in FPGrowthModel has no impact once associationRules got computed .

I try to cache the associationRules to avoid re-computation if minConfidence is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.

How was this patch tested?

new unit test and I strength the unit test for model save/load to ensure the cache mechanism.

hhbyyh · 2017-03-17T18:47:18Z

ping @jkbradley and @srowen to be aware of the issue. also cc @MLnick who's working on python API.

hhbyyh · 2017-03-17T18:49:08Z

mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala

-    ).first().getAs[Seq[String]]("prediction")
-
-    assert(prediction === Seq("3"))
-  }


Didn't change this one, just move it to keep parameter and save/load check at the bottom.

SparkQA · 2017-03-17T19:41:08Z

Test build #74752 has finished for PR 17336 at commit 3398d62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T20:04:43Z

Test build #74995 has finished for PR 17336 at commit 9c046c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-03-30T18:32:32Z

ping @jkbradley as this is something we should fix before release.

jkbradley · 2017-03-31T23:25:23Z

Thanks for this PR! Do you think it's worth adding the caching logic? I'm now wondering if we should change associationRules into a method which recomputes the DataFrame every time it is called. That will make it easier for the user to understand the semantics, and it would be easy for a user to define a val to hold onto the computed DataFrame if needed.

Would you mind updating this accordingly? Thanks!

hhbyyh · 2017-03-31T23:31:13Z

The major thing I'm concerned is that transform will have to recompute the association rules each time it's invoked. If that's not a problem, changing association rules to method would be much simpler.

SparkQA · 2017-04-01T21:53:38Z

Test build #75448 has finished for PR 17336 at commit a95a07a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for mentioning about transform(). I'd been thinking about the user calling associationRules directly. But you're right about it being better to cache for transform(). Sorry to ask this, but would you mind reverting back to the cached version? Thanks a lot.

jkbradley · 2017-04-03T23:29:18Z

mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala

-        model2.freqItemsets.sort("items").collect())
+      assert(model.freqItemsets.collect().toSet.equals(
+        model2.freqItemsets.collect().toSet))
+      assert(model.associationRules.collect().toSet.equals(


No need to add these 2 since they are values computed from the model data. Checking freqItemsets is sufficient.

Thanks for comment. I added the check since we added some internal cache fields, I'd like to ensure it does not interfere with the model loading. Let me know it is still redundant.

Fair enough. Let's keep it

jkbradley · 2017-04-03T23:30:11Z

mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala

      FPGrowthSuite.allParamSettings, checkModelData)
  }
-
-  test("FPGrowth prediction should not contain duplicates") {


For the future, I'd prefer not to move stuff around unless it's necessary since it makes the diff larger. No need to revert this, though, since I already checked it.

SparkQA · 2017-04-04T19:57:46Z

Test build #75517 has finished for PR 17336 at commit 81bce96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-04-05T00:50:05Z

Thanks a lot for the second update! This LGTM
Merging with master

set fpgrwothmodel minconf

3398d62

hhbyyh commented Mar 17, 2017

View reviewed changes

YY-OnCall added 2 commits March 21, 2017 11:55

resolve conflict

f761ffd

adapt to itemsCol

9c046c3

YY-OnCall added 2 commits April 1, 2017 09:52

Merge remote-tracking branch 'upstream/master' into fpmodelminconf

d81fb2f

remove cache

a95a07a

jkbradley reviewed Apr 3, 2017

View reviewed changes

YY-OnCall added 2 commits April 4, 2017 11:08

Merge remote-tracking branch 'upstream/master' into fpmodelminconf

5ef84f1

add cache and comment

81bce96

asfgit closed this in b28bbff Apr 5, 2017

[SPARK-20003] [ML] FPGrowthModel setMinConfidence should affect rules generation and transform #17336

[SPARK-20003] [ML] FPGrowthModel setMinConfidence should affect rules generation and transform #17336

Uh oh!

Conversation

hhbyyh commented Mar 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hhbyyh commented Mar 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhbyyh Mar 17, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 17, 2017

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

hhbyyh commented Mar 30, 2017

Uh oh!

jkbradley commented Mar 31, 2017

Uh oh!

hhbyyh commented Mar 31, 2017

Uh oh!

SparkQA commented Apr 1, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 3, 2017

Choose a reason for hiding this comment

Uh oh!

hhbyyh Apr 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 5, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 3, 2017

Choose a reason for hiding this comment

Uh oh!

hhbyyh Apr 4, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2017

Uh oh!

jkbradley commented Apr 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hhbyyh commented Mar 17, 2017 •

edited

Loading

hhbyyh commented Mar 17, 2017 •

edited

Loading

hhbyyh Apr 4, 2017 •

edited

Loading