[SPARK-14322] [MLlib] Use treeAggregate instead of reduce in OnlineLDAOptimizer #12106

hhbyyh · 2016-04-01T03:23:36Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14322

OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line:

spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

Line 452 in f12f11e

val statsSum: BDM[Double] = stats.map(_._1).reduce(_ += _)

and a few lines below it.

How was this patch tested?

unit tests

SparkQA · 2016-04-01T04:07:36Z

Test build #54684 has finished for PR 12106 at commit 38cf0f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-04-01T06:56:09Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

      }
      Iterator((stat, gammaPart))
    }
-    val statsSum: BDM[Double] = stats.map(_._1).reduce(_ += _)


This doesn't seem right because the first arg is modified in-place, which violates reduce contract. It should be an aggregate (or treeAggregate) instead.

Whoops, that's a long-standing bug... Perhaps we can just backport this PR.

As far as I understand, this does not generate any computation error since it still gives the correct sum, right?

I would expect so in general, but it could return corrupted results in case of a failure.

This is the line which caused the original failure, so using treeAggregate here should help.

I'm thinking .treeReduce(_ + _) is fine here. Internally it will transform it into treeAggregate. Let me know if I'm wrong.

It would be better to modify the first argument, which treeAggregate should support. (Actually I noticed treeAggregate does not say it supports it in the docs, but it should be OK to assume. I just created [https://issues.apache.org/jira/browse/SPARK-14408] for that.)

Get it. Thanks

jkbradley · 2016-04-04T18:26:39Z

I just updated the JIRA to indicate the bug here. Could you please update the PR title and description?

hhbyyh · 2016-04-06T02:57:14Z

@jkbradley Updated. I used flatMap to replace the second reduce.

SparkQA · 2016-04-06T03:22:58Z

Test build #55081 has finished for PR 12106 at commit 42ae469.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-06T18:35:57Z

LGTM
Merging with master, and backported to 1.6, 1.5
Thanks!

Backport to 1.4 was not clean, but probably is not necessary since it is a pretty old version

…Optimizer ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14322 OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix. This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate. See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452 and a few lines below it. ## How was this patch tested? unit tests Author: Yuhao Yang <[email protected]> Closes #12106 from hhbyyh/ldaTreeReduce. (cherry picked from commit 8cffcb6) Signed-off-by: Joseph K. Bradley <[email protected]>

…Optimizer ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14322 OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix. This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate. See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452 and a few lines below it. ## How was this patch tested? unit tests Author: Yuhao Yang <[email protected]> Closes apache#12106 from hhbyyh/ldaTreeReduce. (cherry picked from commit 8cffcb6) Signed-off-by: Joseph K. Bradley <[email protected]> (cherry picked from commit dca0d9a)

change reduce to treeReduce for lda

38cf0f3

mengxr reviewed Apr 1, 2016
View reviewed changes

hhbyyh changed the title ~~[SPARK-14322] [MLlib] Use treeReduce instead of reduce in OnlineLDAOptimizer~~ [SPARK-14322] [MLlib] Use treeAggregate instead of reduce in OnlineLDAOptimizer Apr 5, 2016

use treeaggregate

42ae469

asfgit closed this in 8cffcb6 Apr 6, 2016

[SPARK-14322] [MLlib] Use treeAggregate instead of reduce in OnlineLDAOptimizer #12106

[SPARK-14322] [MLlib] Use treeAggregate instead of reduce in OnlineLDAOptimizer #12106

Uh oh!

Conversation

hhbyyh commented Apr 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Apr 4, 2016

Uh oh!

hhbyyh commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

jkbradley commented Apr 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants