Skip to content

Conversation

@Syrux
Copy link
Contributor

@Syrux Syrux commented Apr 8, 2017

What changes were proposed in this pull request?

Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/

How was this patch tested?

Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
result obtained were stricly the same as the original implementation (without this change).
dev/run-tests was also runned, no error were found.

Author : Cyril de Vogelaere [email protected]

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, though I'm not familiar with this implementation. The idea is that 0 is a delimiter and only needs to be added if something else was added to delimit? do existing tests definitely cover this case?

@Syrux
Copy link
Contributor Author

Syrux commented Apr 8, 2017

Yes exactly, the current implementation adds too much unnecessary delimiters. We this one line change, delimiter are only placed where needed.

Currently there are no tests to verify if the algorithm cleans the sequences correctly. I only found that inneficiency by printing stuff around while I implemented other things on my local github.

If you want, I can add some tests, but that will necessitate a small refactor to separate the cleaning part in it's own method. Calling the current method would directly call the main algorithm ... ^^'

Two of the existing tests did cover cases where sequence of zero where left. However not at pertinent places (Integer/String type, variable-size itemsets clean a five at the end of the third sequence, leaving 2 zero instead of one).

I can however vouch that the previous code worked just fine. Both the results of the old implementation and this one are the same. They also correspond to the results I obtained for another standalone CP based implementation. It's just that this code makes the pre-processing more efficient.

@srowen
Copy link
Member

srowen commented Apr 8, 2017

Even a simplistic test of this case would give a lot more confidence that it's correct. If it means opening up a private[spark] method or two to make testing possible that seems reasonable. I don' think it needs significant change. Something needs to exercise this code path.

@Syrux
Copy link
Contributor Author

Syrux commented Apr 8, 2017

Ok, should I create a new Jira and push there the additionnal tests ?
Or is here completly fine, since it's related to the current change

Tell me, and I will get the change done asap :)

@Syrux
Copy link
Contributor Author

Syrux commented Apr 8, 2017

Hello Sean, I already pushed the requested changes in case it's the correct place to do so.
(I can just revert them, if not)

I added two new methods to allow tests. First a method which finds all frequent items in a database, second a method that actually clean the database using those frequent items. Although I didn't end up using the first method, the pre-processing method is now much clearer to understand. So I left the new method. Just tell me if I need to put that piece of code back.

I also added tests for multiple types of sequence database. More specifically, when there is max one item per itemset, when there can be multiple items per itemsets, and when cleaning the database empties it. They should cover all cases together.

Of course, the new implementation passes the tests perfectly, and the old one doesn't.
Every other thing remained as is.

Tell me if the way I did it was ok. I hope it's up to standards :)

@SparkQA
Copy link

SparkQA commented Apr 9, 2017

Test build #3651 has started for PR 17575 at commit 8e5db6a.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, mostly tiny suggestions

* The map should only contain frequent item.
*
* @return The internal repr of the inputted dataset. With properly placed zero delimiter.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally start wrapping the args to 4-space indent here in this case. Then you can pull up the return type. Also no space before colon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the new version will fix that and the colon space.

private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]],
minCount : Long): Array[Item] = {

data.flatMap { itemsets =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here you're welcome to write itemsets.foreach(_.foreach(item => uniqItems += item))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or does itemsets.foreach(set => uniqItems ++= set) work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed the accolades to parenthesis. (I suppose that what you meant, correct me if I'm wrong)
Also, just by curiosity, do you know if that make any differences in performances ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itemsets.foreach(set => uniqItems ++= set) does work. I will change it in my next commit. I will push it once I know what to do for the flag.


val expected1 = Array(Array(0, 4, 0, 5, 0, 4, 0, 5, 0))
.map(x => x.map(y => {
if (y == 0) 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this on one line

Copy link
Contributor Author

@Syrux Syrux Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changing it to
.map(_.map(x => if (x == 0) 0 else itemToInt1(x) + 1))

val rdd3 = sc.parallelize(sequences3, 2).cache()

val cleanedSequence3 = PrefixSpan.toDatabaseInternalRepr(rdd3, itemToInt3).collect()
val expected3: Array[Array[Int]] = Array()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can this be val expected3 = Array[Array[Int]]()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it can. It even avoids a useless cast.
I will push the new version asap

allItems ++= result.sorted
allItems += 0
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as checking allItems.size > 1 now, rather than maintain a flag?

Copy link
Contributor Author

@Syrux Syrux Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but allItems is an arrayBuilder, so there is no size method.
I could change it to "if(allIItems.result().size > 1)" but I think the performance might be worse than a flag. If you still want me to change it, I will make a new commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. What about waiting to pre-pend the initial 0 until the end, only if not empty?

Copy link
Contributor Author

@Syrux Syrux Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the performance of a prepend on arrayBuilder. I will check them first. Back in a few minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, prepending is impossible on an arrayBuilder. The method doesn't exist (http://www.scala-lang.org/api/2.12.0/scala/collection/mutable/ArrayBuilder.html).
I think the flag is our best bet for performance. Changing it to an arrayBuffer would be far worse since a type encapsulation would be forced on the ints it contain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK no problem, leave it. Just riffing while we're editing the code.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small comments about the code style but seems OK

*/
private[fpm] def findFrequentItems[Item: ClassTag](
data: RDD[Array[Array[Item]]],
minCount: Long):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap this onto the previous line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will push the changes

val uniqItems = mutable.Set.empty[Item]
itemsets.foreach(set => uniqItems ++= set)
uniqItems.toIterator.map((_, 1L))
}.reduceByKey(_ + _).filter { case (_, count) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but this should unindent one unit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #3658 has finished for PR 17575 at commit 627bfe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2017

Test build #3661 has finished for PR 17575 at commit d799d46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 13, 2017

Merged to master

@asfgit asfgit closed this in 095d1cb Apr 13, 2017
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/

## How was this patch tested?

Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
result obtained were stricly the same as the original implementation (without this change).
dev/run-tests was also runned, no error were found.

Author : Cyril de Vogelaere <cyril.devogelaeregmail.com>

Author: Syrux <[email protected]>

Closes apache#17575 from Syrux/SPARK-20265.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants