Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Mar 17, 2017

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel setMinConfidence should always affect rules generation and transform.
Currently associationRules in FPGrowthModel is a lazy val and setMinConfidence in FPGrowthModel has no impact once associationRules got computed .

I try to cache the associationRules to avoid re-computation if minConfidence is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.

How was this patch tested?

new unit test and I strength the unit test for model save/load to ensure the cache mechanism.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 17, 2017

ping @jkbradley and @srowen to be aware of the issue. also cc @MLnick who's working on python API.

).first().getAs[Seq[String]]("prediction")

assert(prediction === Seq("3"))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't change this one, just move it to keep parameter and save/load check at the bottom.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74752 has finished for PR 17336 at commit 3398d62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74995 has finished for PR 17336 at commit 9c046c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 30, 2017

ping @jkbradley as this is something we should fix before release.

@jkbradley
Copy link
Member

Thanks for this PR! Do you think it's worth adding the caching logic? I'm now wondering if we should change associationRules into a method which recomputes the DataFrame every time it is called. That will make it easier for the user to understand the semantics, and it would be easy for a user to define a val to hold onto the computed DataFrame if needed.

Would you mind updating this accordingly? Thanks!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 31, 2017

The major thing I'm concerned is that transform will have to recompute the association rules each time it's invoked. If that's not a problem, changing association rules to method would be much simpler.

@SparkQA
Copy link

SparkQA commented Apr 1, 2017

Test build #75448 has finished for PR 17336 at commit a95a07a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mentioning about transform(). I'd been thinking about the user calling associationRules directly. But you're right about it being better to cache for transform(). Sorry to ask this, but would you mind reverting back to the cached version? Thanks a lot.

model2.freqItemsets.sort("items").collect())
assert(model.freqItemsets.collect().toSet.equals(
model2.freqItemsets.collect().toSet))
assert(model.associationRules.collect().toSet.equals(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add these 2 since they are values computed from the model data. Checking freqItemsets is sufficient.

Copy link
Contributor Author

@hhbyyh hhbyyh Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comment. I added the check since we added some internal cache fields, I'd like to ensure it does not interfere with the model loading. Let me know it is still redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Let's keep it

FPGrowthSuite.allParamSettings, checkModelData)
}

test("FPGrowth prediction should not contain duplicates") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, I'd prefer not to move stuff around unless it's necessary since it makes the diff larger. No need to revert this, though, since I already checked it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@SparkQA
Copy link

SparkQA commented Apr 4, 2017

Test build #75517 has finished for PR 17336 at commit 81bce96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks a lot for the second update! This LGTM
Merging with master

@asfgit asfgit closed this in b28bbff Apr 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants