[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure #1427

mengxr · 2014-07-16T03:02:53Z

We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.

SparkQA · 2014-07-16T03:08:02Z

QA tests have started for PR 1427. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16703/consoleFull

rxin · 2014-07-16T04:14:36Z

This is out of scope for the PR, but it would be great to have some auto-broadcast mechanism ...

mengxr · 2014-07-16T04:42:10Z

Actually the JIRA is for auto-switching between direct serialization and broadcast. It would be nice to implement it in sc.broadcast instead of specific to MLlib.

SparkQA · 2014-07-16T04:47:52Z

QA results for PR 1427:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16703/consoleFull

mengxr · 2014-07-16T05:32:43Z

The failed test is from streaming, which is irrelevant to the change in this PR.

Test Result (1 failure / +1)
org.apache.spark.streaming.NetworkReceiverSuite.block generator throttling

mengxr · 2014-07-16T05:33:30Z

Jenkins, retest this please.

SparkQA · 2014-07-16T05:38:12Z

QA tests have started for PR 1427. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16713/consoleFull

SparkQA · 2014-07-16T07:21:30Z

QA results for PR 1427:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16713/consoleFull

mateiz · 2014-07-26T22:11:36Z

Do you guys want to merge this until we can see whether the RDD change goes into 1.1? Or wait for that? It does seem like a useful fix.

rxin · 2014-07-27T05:55:43Z

Merged in master. Thanks.

rxin · 2014-07-27T05:55:54Z

We can revert this if the broadcast change gets in.

…y into task closure We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small. Author: Xiangrui Meng <[email protected]> Closes apache#1427 from mengxr/broadcast-new and squashes the following commits: b9a1228 [Xiangrui Meng] style update b97c184 [Xiangrui Meng] minimal change to LBFGS 9ebadcc [Xiangrui Meng] add task size test to RowMatrix 9427bf0 [Xiangrui Meng] add task size tests to linear methods e0a5cf2 [Xiangrui Meng] add task size test to GD 28a8411 [Xiangrui Meng] add test for NaiveBayes 380778c [Xiangrui Meng] update KMeans test bccab92 [Xiangrui Meng] add task size test to LBFGS 02103ba [Xiangrui Meng] remove print e73d68e [Xiangrui Meng] update tests for k-means 174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize 1928a5a [Xiangrui Meng] add test for KMeans task size e00c2da [Xiangrui Meng] use broadcast in GD, KMeans 010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast

This bundles Boson with Spark 3.2.0. Tested via the following commands: ``` BOSON_CONF_DIR=$HOME/git/boson/conf $SPARK_HOME/bin/spark-shell \ --conf spark.sql.extensions=com.apple.boson.BosonSparkSessionExtensions \ --conf spark.sql.adaptive.forceApply=true ```

mengxr added 14 commits July 15, 2014 10:34

modify NaiveBayesModel and GLM to use broadcast

010d076

use broadcast in GD, KMeans

e00c2da

add test for KMeans task size

1928a5a

use local-cluster for test with a small akka.frameSize

174cb15

update tests for k-means

e73d68e

remove print

02103ba

add task size test to LBFGS

bccab92

update KMeans test

380778c

add test for NaiveBayes

28a8411

add task size test to GD

e0a5cf2

add task size tests to linear methods

9427bf0

add task size test to RowMatrix

9ebadcc

minimal change to LBFGS

b97c184

style update

b9a1228

asfgit closed this in aaf2b73 Jul 27, 2014

mengxr mentioned this pull request Aug 3, 2014

SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure #1427

[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure #1427

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

rxin commented Jul 16, 2014

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

mateiz commented Jul 26, 2014

Uh oh!

rxin commented Jul 27, 2014

Uh oh!

rxin commented Jul 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure #1427

[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure #1427

Uh oh!

Conversation

mengxr commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

rxin commented Jul 16, 2014

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

mengxr commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

SparkQA commented Jul 16, 2014

Uh oh!

mateiz commented Jul 26, 2014

Uh oh!

rxin commented Jul 27, 2014

Uh oh!

rxin commented Jul 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants