Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Apr 1, 2016

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14322

OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line:

val statsSum: BDM[Double] = stats.map(_._1).reduce(_ += _)

and a few lines below it.

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54684 has finished for PR 12106 at commit 38cf0f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
Iterator((stat, gammaPart))
}
val statsSum: BDM[Double] = stats.map(_._1).reduce(_ += _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right because the first arg is modified in-place, which violates reduce contract. It should be an aggregate (or treeAggregate) instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, that's a long-standing bug... Perhaps we can just backport this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, this does not generate any computation error since it still gives the correct sum, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect so in general, but it could return corrupted results in case of a failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line which caused the original failure, so using treeAggregate here should help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking .treeReduce(_ + _) is fine here. Internally it will transform it into treeAggregate. Let me know if I'm wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to modify the first argument, which treeAggregate should support. (Actually I noticed treeAggregate does not say it supports it in the docs, but it should be OK to assume. I just created [https://issues.apache.org/jira/browse/SPARK-14408] for that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it. Thanks

@jkbradley
Copy link
Member

I just updated the JIRA to indicate the bug here. Could you please update the PR title and description?

@hhbyyh hhbyyh changed the title [SPARK-14322] [MLlib] Use treeReduce instead of reduce in OnlineLDAOptimizer [SPARK-14322] [MLlib] Use treeAggregate instead of reduce in OnlineLDAOptimizer Apr 5, 2016
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 6, 2016

@jkbradley Updated. I used flatMap to replace the second reduce.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55081 has finished for PR 12106 at commit 42ae469.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master, and backported to 1.6, 1.5
Thanks!

Backport to 1.4 was not clean, but probably is not necessary since it is a pretty old version

asfgit pushed a commit that referenced this pull request Apr 6, 2016
…Optimizer

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14322

OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452
and a few lines below it.

## How was this patch tested?
unit tests

Author: Yuhao Yang <[email protected]>

Closes #12106 from hhbyyh/ldaTreeReduce.

(cherry picked from commit 8cffcb6)
Signed-off-by: Joseph K. Bradley <[email protected]>
asfgit pushed a commit that referenced this pull request Apr 6, 2016
…Optimizer

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14322

OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452
and a few lines below it.

## How was this patch tested?
unit tests

Author: Yuhao Yang <[email protected]>

Closes #12106 from hhbyyh/ldaTreeReduce.

(cherry picked from commit 8cffcb6)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 8cffcb6 Apr 6, 2016
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 7, 2016
…Optimizer

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14322

OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452
and a few lines below it.

## How was this patch tested?
unit tests

Author: Yuhao Yang <[email protected]>

Closes apache#12106 from hhbyyh/ldaTreeReduce.

(cherry picked from commit 8cffcb6)
Signed-off-by: Joseph K. Bradley <[email protected]>
(cherry picked from commit dca0d9a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants