-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10697][ML] Add lift to Association rules #22236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #95263 has finished for PR 22236 at commit
|
| // Exclude rules for 2.4.x | ||
| lazy val v24excludes = v23excludes ++ Seq( | ||
| // [SPARK-10697][ML] Add lift to Association rules | ||
| ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.fpm.AssociationRules#Rule.this"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for reviewers and myself: this method is private (private[fpm])
| private val itemSupport: Map[Any, Long]) | ||
| extends Model[FPGrowthModel] with FPGrowthParams with MLWritable { | ||
|
|
||
| private[ml] def this(uid: String, freqItemsets: DataFrame) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's private, you could also just make the new parameter an Option or something rather than recreate the existing constructor separately.
| instance.freqItemsets.write.parquet(dataPath) | ||
| val itemDataType = instance.freqItemsets.schema(instance.getItemsCol).dataType match { | ||
| case ArrayType(et, _) => et | ||
| case other => other // we should never get here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw an exception then?
| val itemSupportPath = new Path(path, "itemSupport") | ||
| val fs = FileSystem.get(sc.hadoopConfiguration) | ||
| val itemSupport = if (fs.exists(itemSupportPath)) { | ||
| sparkSession.read.parquet(itemSupportPath.toString).rdd.collect().map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about collectAsMap here?
| freqCol: String, | ||
| minConfidence: Double): DataFrame = { | ||
| minConfidence: Double, | ||
| itemSupport: Map[Any, Long]): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a Map[T, Long]? it's required to be anyway below. It's worth adding to the scaladoc too.
| /** | ||
| * Computes the association rules with confidence above `minConfidence`. | ||
| * @param freqItemsets frequent itemset model obtained from [[FPGrowth]] | ||
| * @return a `Set[Rule[Item]]` containing the association rules. The rules will be able to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't agree with the signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, thanks, I'll correct also the original one I took this from.
| freqUnion: Double, | ||
| freqAntecedent: Double) extends Serializable { | ||
| freqAntecedent: Double, | ||
| freqConsequent: Option[Long]) extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we now always know the frequency of the consequent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we do, but for existing application using old methods it may not be available.
| */ | ||
| @Since("2.4.0") | ||
| def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]], | ||
| itemSupport: Map[Item, Long]): RDD[Rule[Item]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand this correctly, and I may not, FPGrowthModel just holds frequent item sets. It's only association rules where the lift computation is needed. In the course of computing association rules, you can compute item support here. Why does it need to be saved with the model? I can see it might be an optimization but also introduces complexity (and compatibility issues?) here. It may be pretty fast to compute right here though. You already end up with (..., (consequent, count)) in candidates, from which you can get the total consequent counts directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can compute it by filtering the freqItemsets and get the items with length one. The reason why I haven't done that is to avoid performance regression. Since we already computed this before, it seems a unneeded waste to recompute it here.
I can use this approach of computing them when loading the model, though, I agree. Also in that case I haven't done for optimization reasons (since we would need to read 2 times the freqItemsets dataset which is surely much larger).
If you think this is needed, I can change it, but for performance reasons I prefer the current approach, in order not to affect existing users not interested in the lift metric. I don't think any compatibility issue can arise as if the value is not preset, null is returned for the lift metric.
|
Test build #95265 has finished for PR 22236 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the point about saving the result of the computation, but if it has to rely on these counts being saved in the model, then old models can't produce association rules that include lift, when it could still be computed from the same data. It also means storing the additional data, and the complexity of reading it. It just seems like it would be more straightforward to (re)compute it, unless it's terribly expensive, but my hunch is it isn't significant compared to the other work the algorithm does.
| instance.freqItemsets.write.parquet(dataPath) | ||
| val itemDataType = instance.freqItemsets.schema(instance.getItemsCol).dataType match { | ||
| case ArrayType(et, _) => et | ||
| case other => throw new RuntimeException(s"Expected ${ArrayType.simpleString}, but got " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly prefer subclasses like IllegalArgumentException or IllegalStateException, but it's just a matter of taste. You can interpolate the second argument and probably get it on one line if you break before the message starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I'll do, thanks.
|
@srowen then what about recomputing when reading saved models? This seems a good compromise to me as it saves the writing of the data, it allows having lift for old models, but it doesn't introduce any perf regression when creating a model (anyway, we would need to save the count from the original model, as we cannot otherwise recompute the support correctly). Of course the "terribly expensive" quite depends on how much data there is etc.etc. Anyway, it is a pass on the |
|
Yeah, I like that idea. Just compute it on initializing the model. |
|
Test build #95287 has finished for PR 22236 at commit
|
|
Test build #95294 has finished for PR 22236 at commit
|
|
Test build #95314 has finished for PR 22236 at commit
|
|
thanks for pointing that out @hhbyyh . I wasn't aware of it. I think lift and confidence are more useful metrics than the support (the votes on the JIRAs seem also to agree with me on this). Anyway, after this PR, adding support too would be trivial. I checked your PR @hhbyyh and if you don't mind I'll take the naming from there as I prefer it over the one I used here. :) Thanks. |
|
@holdenk @srowen anyway I realized that computing the lift for previously saved models is not feasible as we don't know the number of training records, so we cannot compute the items support. I'll keep the re-computation at read time in order to avoid saving the |
|
Test build #95341 has finished for PR 22236 at commit
|
|
Take it and good luck. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah OK looks like it's on the right track. Adding @jkbradley who was looking at #17280 which I also missed. If this goes in, I personally think the other PR could be updated to build on this and add support too. It's trivial as you say. Thanks @hhbyyh
| } | ||
|
|
||
| copyValues(new FPGrowthModel(uid, frequentItems)).setParent(this) | ||
| copyValues(new FPGrowthModel(uid, frequentItems, parentModel.itemSupport, inputRowCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the total count n won't just equal the sum of the count of consequents, for example, because the frequent item set was pruned of infrequent sets? Darn, yeah you need to deal with the case where you know n and where you don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is no way to know the total size of the training dataset from the frequent itemsets unfortunately... So yes, we need to deal with it unfortunately.
| @Since("2.2.0") override val uid: String, | ||
| @Since("2.2.0") @transient val freqItemsets: DataFrame) | ||
| @Since("2.2.0") @transient val freqItemsets: DataFrame, | ||
| private val itemSupport: scala.collection.Map[Any, Double], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there's no way of adding the item generic type here; it's really in the schema of the DataFrame. Does Map[_, Double] also work? I don't think it needs a change, just a side question.
If you have the support for every item, do you need the overall count here as well? the item counts have already been divided through by numTrainingRecords here. Below only itemSupport is really passed somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Map[_, Double] seems to work too. I can change it if you prefer.
The numTrainingRecords is used because it needs to be stored (and it then used when loading the model in order to recompute the itemSupport).
| @Since("1.5.0") val consequent: Array[Item], | ||
| freqUnion: Double, | ||
| freqAntecedent: Double) extends Serializable { | ||
| freqAntecedent: Double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally these frequencies would have been Longs I think, but too late. Yes, stay consistent.
| * | ||
| */ | ||
| @Since("1.5.0") | ||
| def confidence: Double = freqUnion.toDouble / freqAntecedent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this .toDouble is redundant actually!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is not related to this PR but I can remove it.
| // Exclude rules for 2.4.x | ||
| lazy val v24excludes = v23excludes ++ Seq( | ||
| // [SPARK-10697][ML] Add lift to Association rules | ||
| ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.fpm.FPGrowthModel.this"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are for the private[ml] constructors right? OK to suppress, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, they are the private ones.
|
Test build #95407 has finished for PR 22236 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me. Ideally an extra look from @mengxr or @MLnick or @jkbradley would be helpful here as it's a non-trivial addition, but think it's executed well.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one last thought. The FPM implementation in Python and R needs an update too. I think they only expose confidence, etc in the DataFrame API, so this is really a doc change, to mention that the returned DataFrame also contains lift now.
|
yes, thanks, forgot to update in the Scala API too, so I updated all them accordingly. Thanks. |
|
Test build #95548 has finished for PR 22236 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
The PR adds the lift measure to Association rules.
How was this patch tested?
existing and modified UTs