Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

The PR adds the lift measure to Association rules.

How was this patch tested?

existing and modified UTs

@SparkQA
Copy link

SparkQA commented Aug 26, 2018

Test build #95263 has finished for PR 22236 at commit 92c32dd.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthModel[Item: ClassTag] @Since(\"2.4.0\") (

// Exclude rules for 2.4.x
lazy val v24excludes = v23excludes ++ Seq(
// [SPARK-10697][ML] Add lift to Association rules
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.fpm.AssociationRules#Rule.this"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for reviewers and myself: this method is private (private[fpm])

@mgaido91
Copy link
Contributor Author

cc @holdenk @srowen @yanboliang

private val itemSupport: Map[Any, Long])
extends Model[FPGrowthModel] with FPGrowthParams with MLWritable {

private[ml] def this(uid: String, freqItemsets: DataFrame) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's private, you could also just make the new parameter an Option or something rather than recreate the existing constructor separately.

instance.freqItemsets.write.parquet(dataPath)
val itemDataType = instance.freqItemsets.schema(instance.getItemsCol).dataType match {
case ArrayType(et, _) => et
case other => other // we should never get here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an exception then?

val itemSupportPath = new Path(path, "itemSupport")
val fs = FileSystem.get(sc.hadoopConfiguration)
val itemSupport = if (fs.exists(itemSupportPath)) {
sparkSession.read.parquet(itemSupportPath.toString).rdd.collect().map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about collectAsMap here?

freqCol: String,
minConfidence: Double): DataFrame = {
minConfidence: Double,
itemSupport: Map[Any, Long]): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a Map[T, Long]? it's required to be anyway below. It's worth adding to the scaladoc too.

/**
* Computes the association rules with confidence above `minConfidence`.
* @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
* @return a `Set[Rule[Item]]` containing the association rules. The rules will be able to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't agree with the signature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thanks, I'll correct also the original one I took this from.

freqUnion: Double,
freqAntecedent: Double) extends Serializable {
freqAntecedent: Double,
freqConsequent: Option[Long]) extends Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we now always know the frequency of the consequent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we do, but for existing application using old methods it may not be available.

*/
@Since("2.4.0")
def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]],
itemSupport: Map[Item, Long]): RDD[Rule[Item]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly, and I may not, FPGrowthModel just holds frequent item sets. It's only association rules where the lift computation is needed. In the course of computing association rules, you can compute item support here. Why does it need to be saved with the model? I can see it might be an optimization but also introduces complexity (and compatibility issues?) here. It may be pretty fast to compute right here though. You already end up with (..., (consequent, count)) in candidates, from which you can get the total consequent counts directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can compute it by filtering the freqItemsets and get the items with length one. The reason why I haven't done that is to avoid performance regression. Since we already computed this before, it seems a unneeded waste to recompute it here.

I can use this approach of computing them when loading the model, though, I agree. Also in that case I haven't done for optimization reasons (since we would need to read 2 times the freqItemsets dataset which is surely much larger).

If you think this is needed, I can change it, but for performance reasons I prefer the current approach, in order not to affect existing users not interested in the lift metric. I don't think any compatibility issue can arise as if the value is not preset, null is returned for the lift metric.

@SparkQA
Copy link

SparkQA commented Aug 26, 2018

Test build #95265 has finished for PR 22236 at commit 7f052b8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the point about saving the result of the computation, but if it has to rely on these counts being saved in the model, then old models can't produce association rules that include lift, when it could still be computed from the same data. It also means storing the additional data, and the complexity of reading it. It just seems like it would be more straightforward to (re)compute it, unless it's terribly expensive, but my hunch is it isn't significant compared to the other work the algorithm does.

instance.freqItemsets.write.parquet(dataPath)
val itemDataType = instance.freqItemsets.schema(instance.getItemsCol).dataType match {
case ArrayType(et, _) => et
case other => throw new RuntimeException(s"Expected ${ArrayType.simpleString}, but got " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer subclasses like IllegalArgumentException or IllegalStateException, but it's just a matter of taste. You can interpolate the second argument and probably get it on one line if you break before the message starts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'll do, thanks.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 27, 2018

@srowen then what about recomputing when reading saved models? This seems a good compromise to me as it saves the writing of the data, it allows having lift for old models, but it doesn't introduce any perf regression when creating a model (anyway, we would need to save the count from the original model, as we cannot otherwise recompute the support correctly).

Of course the "terribly expensive" quite depends on how much data there is etc.etc. Anyway, it is a pass on the freqItemset RDD. As it is not cached, it means we have to generate all the possible itemsets and perform an aggregation on them. Maybe that is not the most expensive part of the algorithm, but saving it seems worth to me.

@srowen
Copy link
Member

srowen commented Aug 27, 2018

Yeah, I like that idea. Just compute it on initializing the model.

@SparkQA
Copy link

SparkQA commented Aug 27, 2018

Test build #95287 has finished for PR 22236 at commit 4c8b7be.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2018

Test build #95294 has finished for PR 22236 at commit 957a6a2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2018

Test build #95314 has finished for PR 22236 at commit 88eb571.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Aug 28, 2018

just FYI about another related PR: #17280
and maybe I should close it? @srowen

@mgaido91
Copy link
Contributor Author

thanks for pointing that out @hhbyyh . I wasn't aware of it. I think lift and confidence are more useful metrics than the support (the votes on the JIRAs seem also to agree with me on this). Anyway, after this PR, adding support too would be trivial.

I checked your PR @hhbyyh and if you don't mind I'll take the naming from there as I prefer it over the one I used here. :) Thanks.

@mgaido91
Copy link
Contributor Author

@holdenk @srowen anyway I realized that computing the lift for previously saved models is not feasible as we don't know the number of training records, so we cannot compute the items support.

I'll keep the re-computation at read time in order to avoid saving the itemSupport map, but for previous models the lift is going to be null as we cannot compute it. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 28, 2018

Test build #95341 has finished for PR 22236 at commit 44a0021.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Aug 28, 2018

Take it and good luck.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah OK looks like it's on the right track. Adding @jkbradley who was looking at #17280 which I also missed. If this goes in, I personally think the other PR could be updated to build on this and add support too. It's trivial as you say. Thanks @hhbyyh

}

copyValues(new FPGrowthModel(uid, frequentItems)).setParent(this)
copyValues(new FPGrowthModel(uid, frequentItems, parentModel.itemSupport, inputRowCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the total count n won't just equal the sum of the count of consequents, for example, because the frequent item set was pruned of infrequent sets? Darn, yeah you need to deal with the case where you know n and where you don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is no way to know the total size of the training dataset from the frequent itemsets unfortunately... So yes, we need to deal with it unfortunately.

@Since("2.2.0") override val uid: String,
@Since("2.2.0") @transient val freqItemsets: DataFrame)
@Since("2.2.0") @transient val freqItemsets: DataFrame,
private val itemSupport: scala.collection.Map[Any, Double],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's no way of adding the item generic type here; it's really in the schema of the DataFrame. Does Map[_, Double] also work? I don't think it needs a change, just a side question.

If you have the support for every item, do you need the overall count here as well? the item counts have already been divided through by numTrainingRecords here. Below only itemSupport is really passed somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Map[_, Double] seems to work too. I can change it if you prefer.

The numTrainingRecords is used because it needs to be stored (and it then used when loading the model in order to recompute the itemSupport).

@Since("1.5.0") val consequent: Array[Item],
freqUnion: Double,
freqAntecedent: Double) extends Serializable {
freqAntecedent: Double,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally these frequencies would have been Longs I think, but too late. Yes, stay consistent.

*
*/
@Since("1.5.0")
def confidence: Double = freqUnion.toDouble / freqAntecedent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this .toDouble is redundant actually!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is not related to this PR but I can remove it.

// Exclude rules for 2.4.x
lazy val v24excludes = v23excludes ++ Seq(
// [SPARK-10697][ML] Add lift to Association rules
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.fpm.FPGrowthModel.this"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are for the private[ml] constructors right? OK to suppress, yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they are the private ones.

@SparkQA
Copy link

SparkQA commented Aug 29, 2018

Test build #95407 has finished for PR 22236 at commit 706303f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me. Ideally an extra look from @mengxr or @MLnick or @jkbradley would be helpful here as it's a non-trivial addition, but think it's executed well.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one last thought. The FPM implementation in Python and R needs an update too. I think they only expose confidence, etc in the DataFrame API, so this is really a doc change, to mention that the returned DataFrame also contains lift now.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 31, 2018

yes, thanks, forgot to update in the Scala API too, so I updated all them accordingly. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 31, 2018

Test build #95548 has finished for PR 22236 at commit 2407e05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 1, 2018

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants