[SPARK-10697][ML] Add lift to Association rules #22236

mgaido91 · 2018-08-26T13:41:49Z

What changes were proposed in this pull request?

The PR adds the lift measure to Association rules.

How was this patch tested?

existing and modified UTs

SparkQA · 2018-08-26T13:56:50Z

Test build #95263 has finished for PR 22236 at commit 92c32dd.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthModel[Item: ClassTag] @Since(\"2.4.0\") (

mgaido91 · 2018-08-26T14:49:38Z

project/MimaExcludes.scala

  // Exclude rules for 2.4.x
  lazy val v24excludes = v23excludes ++ Seq(
+    // [SPARK-10697][ML] Add lift to Association rules
+    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.fpm.AssociationRules#Rule.this"),


note for reviewers and myself: this method is private (private[fpm])

mgaido91 · 2018-08-26T16:12:41Z

cc @holdenk @srowen @yanboliang

srowen · 2018-08-26T17:30:29Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

+    private val itemSupport: Map[Any, Long])
  extends Model[FPGrowthModel] with FPGrowthParams with MLWritable {

+  private[ml] def this(uid: String, freqItemsets: DataFrame) =


Because it's private, you could also just make the new parameter an Option or something rather than recreate the existing constructor separately.

srowen · 2018-08-26T17:30:53Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

      instance.freqItemsets.write.parquet(dataPath)
+      val itemDataType = instance.freqItemsets.schema(instance.getItemsCol).dataType match {
+        case ArrayType(et, _) => et
+        case other => other // we should never get here


Throw an exception then?

srowen · 2018-08-26T17:31:26Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

+      val itemSupportPath = new Path(path, "itemSupport")
+      val fs = FileSystem.get(sc.hadoopConfiguration)
+      val itemSupport = if (fs.exists(itemSupportPath)) {
+        sparkSession.read.parquet(itemSupportPath.toString).rdd.collect().map {


How about collectAsMap here?

srowen · 2018-08-26T17:32:04Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

        freqCol: String,
-        minConfidence: Double): DataFrame = {
+        minConfidence: Double,
+        itemSupport: Map[Any, Long]): DataFrame = {


Shouldn't this be a Map[T, Long]? it's required to be anyway below. It's worth adding to the scaladoc too.

srowen · 2018-08-26T17:33:09Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala

+  /**
+   * Computes the association rules with confidence above `minConfidence`.
+   * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
+   * @return a `Set[Rule[Item]]` containing the association rules. The rules will be able to


This doesn't agree with the signature

nice catch, thanks, I'll correct also the original one I took this from.

srowen · 2018-08-26T17:48:07Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala

      freqUnion: Double,
-      freqAntecedent: Double) extends Serializable {
+      freqAntecedent: Double,
+      freqConsequent: Option[Long]) extends Serializable {


Wouldn't we now always know the frequency of the consequent?

now we do, but for existing application using old methods it may not be available.

srowen · 2018-08-26T17:50:33Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala

+   */
+  @Since("2.4.0")
+  def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]],
+      itemSupport: Map[Item, Long]): RDD[Rule[Item]] = {


So if I understand this correctly, and I may not, FPGrowthModel just holds frequent item sets. It's only association rules where the lift computation is needed. In the course of computing association rules, you can compute item support here. Why does it need to be saved with the model? I can see it might be an optimization but also introduces complexity (and compatibility issues?) here. It may be pretty fast to compute right here though. You already end up with (..., (consequent, count)) in candidates, from which you can get the total consequent counts directly.

Actually we can compute it by filtering the freqItemsets and get the items with length one. The reason why I haven't done that is to avoid performance regression. Since we already computed this before, it seems a unneeded waste to recompute it here.

I can use this approach of computing them when loading the model, though, I agree. Also in that case I haven't done for optimization reasons (since we would need to read 2 times the freqItemsets dataset which is surely much larger).

If you think this is needed, I can change it, but for performance reasons I prefer the current approach, in order not to affect existing users not interested in the lift metric. I don't think any compatibility issue can arise as if the value is not preset, null is returned for the lift metric.

SparkQA · 2018-08-26T18:58:39Z

Test build #95265 has finished for PR 22236 at commit 7f052b8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I get the point about saving the result of the computation, but if it has to rely on these counts being saved in the model, then old models can't produce association rules that include lift, when it could still be computed from the same data. It also means storing the additional data, and the complexity of reading it. It just seems like it would be more straightforward to (re)compute it, unless it's terribly expensive, but my hunch is it isn't significant compared to the other work the algorithm does.

srowen · 2018-08-27T14:06:11Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

      instance.freqItemsets.write.parquet(dataPath)
+      val itemDataType = instance.freqItemsets.schema(instance.getItemsCol).dataType match {
+        case ArrayType(et, _) => et
+        case other => throw new RuntimeException(s"Expected ${ArrayType.simpleString}, but got " +


I slightly prefer subclasses like IllegalArgumentException or IllegalStateException, but it's just a matter of taste. You can interpolate the second argument and probably get it on one line if you break before the message starts.

sure, I'll do, thanks.

mgaido91 · 2018-08-27T14:32:41Z

@srowen then what about recomputing when reading saved models? This seems a good compromise to me as it saves the writing of the data, it allows having lift for old models, but it doesn't introduce any perf regression when creating a model (anyway, we would need to save the count from the original model, as we cannot otherwise recompute the support correctly).

Of course the "terribly expensive" quite depends on how much data there is etc.etc. Anyway, it is a pass on the freqItemset RDD. As it is not cached, it means we have to generate all the possible itemsets and perform an aggregation on them. Maybe that is not the most expensive part of the algorithm, but saving it seems worth to me.

srowen · 2018-08-27T16:27:33Z

Yeah, I like that idea. Just compute it on initializing the model.

SparkQA · 2018-08-27T19:08:03Z

Test build #95287 has finished for PR 22236 at commit 4c8b7be.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-27T21:12:57Z

Test build #95294 has finished for PR 22236 at commit 957a6a2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-28T02:32:14Z

Test build #95314 has finished for PR 22236 at commit 88eb571.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2018-08-28T05:26:38Z

just FYI about another related PR: #17280
and maybe I should close it? @srowen

mgaido91 · 2018-08-28T08:48:26Z

thanks for pointing that out @hhbyyh . I wasn't aware of it. I think lift and confidence are more useful metrics than the support (the votes on the JIRAs seem also to agree with me on this). Anyway, after this PR, adding support too would be trivial.

I checked your PR @hhbyyh and if you don't mind I'll take the naming from there as I prefer it over the one I used here. :) Thanks.

mgaido91 · 2018-08-28T08:53:16Z

@holdenk @srowen anyway I realized that computing the lift for previously saved models is not feasible as we don't know the number of training records, so we cannot compute the items support.

I'll keep the re-computation at read time in order to avoid saving the itemSupport map, but for previous models the lift is going to be null as we cannot compute it. Thanks.

SparkQA · 2018-08-28T13:27:56Z

Test build #95341 has finished for PR 22236 at commit 44a0021.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2018-08-28T17:27:51Z

Take it and good luck.

srowen

Yeah OK looks like it's on the right track. Adding @jkbradley who was looking at #17280 which I also missed. If this goes in, I personally think the other PR could be updated to build on this and add support too. It's trivial as you say. Thanks @hhbyyh

srowen · 2018-08-29T01:32:20Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

    }

-    copyValues(new FPGrowthModel(uid, frequentItems)).setParent(this)
+    copyValues(new FPGrowthModel(uid, frequentItems, parentModel.itemSupport, inputRowCount))


Ah, so the total count n won't just equal the sum of the count of consequents, for example, because the frequent item set was pruned of infrequent sets? Darn, yeah you need to deal with the case where you know n and where you don't.

Well, there is no way to know the total size of the training dataset from the frequent itemsets unfortunately... So yes, we need to deal with it unfortunately.

srowen · 2018-08-29T01:52:54Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

    @Since("2.2.0") override val uid: String,
-    @Since("2.2.0") @transient val freqItemsets: DataFrame)
+    @Since("2.2.0") @transient val freqItemsets: DataFrame,
+    private val itemSupport: scala.collection.Map[Any, Double],


I suppose there's no way of adding the item generic type here; it's really in the schema of the DataFrame. Does Map[_, Double] also work? I don't think it needs a change, just a side question.

If you have the support for every item, do you need the overall count here as well? the item counts have already been divided through by numTrainingRecords here. Below only itemSupport is really passed somewhere else.

Yes, Map[_, Double] seems to work too. I can change it if you prefer.

The numTrainingRecords is used because it needs to be stored (and it then used when loading the model in order to recompute the itemSupport).

srowen · 2018-08-29T01:53:55Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala

      @Since("1.5.0") val consequent: Array[Item],
      freqUnion: Double,
-      freqAntecedent: Double) extends Serializable {
+      freqAntecedent: Double,


Ideally these frequencies would have been Longs I think, but too late. Yes, stay consistent.

srowen · 2018-08-29T01:54:10Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala

     *
     */
    @Since("1.5.0")
    def confidence: Double = freqUnion.toDouble / freqAntecedent


I suppose this .toDouble is redundant actually!

yes, it is not related to this PR but I can remove it.

srowen · 2018-08-29T01:55:19Z

project/MimaExcludes.scala

  // Exclude rules for 2.4.x
  lazy val v24excludes = v23excludes ++ Seq(
+    // [SPARK-10697][ML] Add lift to Association rules
+    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.fpm.FPGrowthModel.this"),


These are for the private[ml] constructors right? OK to suppress, yes

yes, they are the private ones.

SparkQA · 2018-08-29T12:28:35Z

Test build #95407 has finished for PR 22236 at commit 706303f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking good to me. Ideally an extra look from @mengxr or @MLnick or @jkbradley would be helpful here as it's a non-trivial addition, but think it's executed well.

srowen

Oh, one last thought. The FPM implementation in Python and R needs an update too. I think they only expose confidence, etc in the DataFrame API, so this is really a doc change, to mention that the returned DataFrame also contains lift now.

mgaido91 · 2018-08-31T14:25:00Z

yes, thanks, forgot to update in the Scala API too, so I updated all them accordingly. Thanks.

SparkQA · 2018-08-31T19:00:00Z

Test build #95548 has finished for PR 22236 at commit 2407e05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-09-01T23:08:21Z

Merged to master

[SPARK-10697][ML] Add lift to Association rules

92c32dd

fix mima

7f052b8

mgaido91 commented Aug 26, 2018

View reviewed changes

srowen requested changes Aug 26, 2018

View reviewed changes

fix + address comments

4c8b7be

srowen reviewed Aug 27, 2018

View reviewed changes

address comments

5970876

compute itemSupport instead of saving them

957a6a2

fix R failure

88eb571

improve naming + old version read support

44a0021

srowen requested changes Aug 29, 2018

View reviewed changes

mgaido91 added 2 commits August 29, 2018 09:41

Merge branch 'master' of github.com:apache/spark into SPARK-10697

1b4e3b3

address comments

706303f

srowen requested changes Aug 29, 2018

View reviewed changes

srowen reviewed Aug 31, 2018

View reviewed changes

update docs in scala, python, R

2407e05

srowen approved these changes Sep 1, 2018

View reviewed changes

asfgit closed this in a3dccd2 Sep 1, 2018

nchammas mentioned this pull request Nov 15, 2021

[SPARK-37335][ML] Flesh out FPGrowth docs #34605

Closed

[SPARK-10697][ML] Add lift to Association rules #22236

[SPARK-10697][ML] Add lift to Association rules #22236

Uh oh!

Conversation

mgaido91 commented Aug 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Aug 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

hhbyyh commented Aug 28, 2018

Uh oh!

mgaido91 commented Aug 28, 2018

Uh oh!

mgaido91 commented Aug 28, 2018

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

hhbyyh commented Aug 28, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mgaido91 commented Aug 27, 2018 •

edited

Loading

mgaido91 commented Aug 31, 2018 •

edited

Loading