[SPARK-20265][MLlib] Improve Prefix'span pre-processing efficiency #17575

Syrux · 2017-04-08T10:26:56Z

What changes were proposed in this pull request?

Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/

How was this patch tested?

Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
result obtained were stricly the same as the original implementation (without this change).
dev/run-tests was also runned, no error were found.

Author : Cyril de Vogelaere [email protected]

srowen

I think that makes sense, though I'm not familiar with this implementation. The idea is that 0 is a delimiter and only needs to be added if something else was added to delimit? do existing tests definitely cover this case?

Syrux · 2017-04-08T14:13:59Z

Yes exactly, the current implementation adds too much unnecessary delimiters. We this one line change, delimiter are only placed where needed.

Currently there are no tests to verify if the algorithm cleans the sequences correctly. I only found that inneficiency by printing stuff around while I implemented other things on my local github.

If you want, I can add some tests, but that will necessitate a small refactor to separate the cleaning part in it's own method. Calling the current method would directly call the main algorithm ... ^^'

Two of the existing tests did cover cases where sequence of zero where left. However not at pertinent places (Integer/String type, variable-size itemsets clean a five at the end of the third sequence, leaving 2 zero instead of one).

I can however vouch that the previous code worked just fine. Both the results of the old implementation and this one are the same. They also correspond to the results I obtained for another standalone CP based implementation. It's just that this code makes the pre-processing more efficient.

srowen · 2017-04-08T16:17:11Z

Even a simplistic test of this case would give a lot more confidence that it's correct. If it means opening up a private[spark] method or two to make testing possible that seems reasonable. I don' think it needs significant change. Something needs to exercise this code path.

Syrux · 2017-04-08T16:54:17Z

Ok, should I create a new Jira and push there the additionnal tests ?
Or is here completly fine, since it's related to the current change

Tell me, and I will get the change done asap :)

…g correctness

Syrux · 2017-04-08T19:40:58Z

Hello Sean, I already pushed the requested changes in case it's the correct place to do so.
(I can just revert them, if not)

I added two new methods to allow tests. First a method which finds all frequent items in a database, second a method that actually clean the database using those frequent items. Although I didn't end up using the first method, the pre-processing method is now much clearer to understand. So I left the new method. Just tell me if I need to put that piece of code back.

I also added tests for multiple types of sequence database. More specifically, when there is max one item per itemset, when there can be multiple items per itemsets, and when cleaning the database empties it. They should cover all cases together.

Of course, the new implementation passes the tests perfectly, and the old one doesn't.
Every other thing remained as is.

Tell me if the way I did it was ok. I hope it's up to standards :)

SparkQA · 2017-04-09T06:49:28Z

Test build #3651 has started for PR 17575 at commit 8e5db6a.

srowen

Looking good, mostly tiny suggestions

srowen · 2017-04-10T08:36:13Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+   *                  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly placed zero delimiter.
+   */


We generally start wrapping the args to 4-space indent here in this case. Then you can pull up the return type. Also no space before colon

Ok, the new version will fix that and the colon space.

srowen · 2017-04-10T08:38:05Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+  private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]],
+                                                     minCount : Long): Array[Item] = {
+
+    data.flatMap { itemsets =>


While you're here you're welcome to write itemsets.foreach(_.foreach(item => uniqItems += item))

or does itemsets.foreach(set => uniqItems ++= set) work?

Ok, changed the accolades to parenthesis. (I suppose that what you meant, correct me if I'm wrong)
Also, just by curiosity, do you know if that make any differences in performances ?

itemsets.foreach(set => uniqItems ++= set) does work. I will change it in my next commit. I will push it once I know what to do for the flag.

srowen · 2017-04-10T08:39:41Z

mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala

+
+    val expected1 = Array(Array(0, 4, 0, 5, 0, 4, 0, 5, 0))
+      .map(x => x.map(y => {
+        if (y == 0) 0


I'd put this on one line

Ok, changing it to
.map(_.map(x => if (x == 0) 0 else itemToInt1(x) + 1))

srowen · 2017-04-10T08:40:26Z

mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala

+    val rdd3 = sc.parallelize(sequences3, 2).cache()
+
+    val cleanedSequence3 = PrefixSpan.toDatabaseInternalRepr(rdd3, itemToInt3).collect()
+    val expected3: Array[Array[Int]] = Array()


Nit: can this be val expected3 = Array[Array[Int]]()?

Yep, it can. It even avoids a useless cast.
I will push the new version asap

srowen · 2017-04-10T08:41:28Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+          allItems ++= result.sorted
+          allItems += 0
+        }
+      }


Is this the same as checking allItems.size > 1 now, rather than maintain a flag?

Yes, but allItems is an arrayBuilder, so there is no size method.
I could change it to "if(allIItems.result().size > 1)" but I think the performance might be worse than a flag. If you still want me to change it, I will make a new commit.

I see. What about waiting to pre-pend the initial 0 until the end, only if not empty?

I am not sure about the performance of a prepend on arrayBuilder. I will check them first. Back in a few minutes.

Apparently, prepending is impossible on an arrayBuilder. The method doesn't exist (http://www.scala-lang.org/api/2.12.0/scala/collection/mutable/ArrayBuilder.html).
I think the flag is our best bet for performance. Changing it to an arrayBuffer would be far worse since a type encapsulation would be forced on the ints it contain.

OK no problem, leave it. Just riffing while we're editing the code.

srowen

A few more small comments about the code style but seems OK

srowen · 2017-04-11T07:07:14Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](
+      data: RDD[Array[Array[Item]]],
+      minCount: Long):


Wrap this onto the previous line

Ok, I will push the changes

srowen · 2017-04-11T07:07:26Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+      val uniqItems = mutable.Set.empty[Item]
+      itemsets.foreach(set => uniqItems ++= set)
+      uniqItems.toIterator.map((_, 1L))
+    }.reduceByKey(_ + _).filter { case (_, count) =>


Nit, but this should unindent one unit

SparkQA · 2017-04-11T08:05:10Z

Test build #3658 has finished for PR 17575 at commit 627bfe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-12T09:17:57Z

Test build #3661 has finished for PR 17575 at commit d799d46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-04-13T08:44:39Z

Merged to master

## What changes were proposed in this pull request? Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database. The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/ ## How was this patch tested? Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All result obtained were stricly the same as the original implementation (without this change). dev/run-tests was also runned, no error were found. Author : Cyril de Vogelaere <cyril.devogelaeregmail.com> Author: Syrux <[email protected]> Closes apache#17575 from Syrux/SPARK-20265.

[SPARK-20265][MLlib] Improve Prefix'span pre-processing efficiency

7af4945

srowen reviewed Apr 8, 2017

View reviewed changes

Separating cleaning methods in standalone methods + Tests for cleanin…

8e5db6a

…g correctness

srowen requested changes Apr 10, 2017

View reviewed changes

Syrux added 3 commits April 10, 2017 16:11

Small corrections, done as requested

47bd983

I had forgotten a four space indent, fixed it

25ece47

Changing to itemsets.foreach(set => uniqItems ++= set)

627bfe0

srowen reviewed Apr 11, 2017

View reviewed changes

Small changes in code style

d799d46

asfgit closed this in 095d1cb Apr 13, 2017

[SPARK-20265][MLlib] Improve Prefix'span pre-processing efficiency #17575

[SPARK-20265][MLlib] Improve Prefix'span pre-processing efficiency #17575

Uh oh!

Conversation

Syrux commented Apr 8, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Syrux commented Apr 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Apr 8, 2017

Uh oh!

Syrux commented Apr 8, 2017

Uh oh!

Syrux commented Apr 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 9, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Syrux Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Syrux Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Syrux Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2017

Uh oh!

SparkQA commented Apr 12, 2017

Uh oh!

srowen commented Apr 13, 2017

Syrux commented Apr 8, 2017 •

edited

Loading

Syrux commented Apr 8, 2017 •

edited

Loading

Syrux Apr 10, 2017 •

edited

Loading

Syrux Apr 10, 2017 •

edited

Loading

Syrux Apr 10, 2017 •

edited

Loading