-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure #1427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 1427. This patch merges cleanly. |
|
This is out of scope for the PR, but it would be great to have some auto-broadcast mechanism ... |
|
Actually the JIRA is for auto-switching between direct serialization and broadcast. It would be nice to implement it in |
|
QA results for PR 1427: |
|
The failed test is from streaming, which is irrelevant to the change in this PR. |
|
Jenkins, retest this please. |
|
QA tests have started for PR 1427. This patch merges cleanly. |
|
QA results for PR 1427: |
|
Do you guys want to merge this until we can see whether the RDD change goes into 1.1? Or wait for that? It does seem like a useful fix. |
|
Merged in master. Thanks. |
|
We can revert this if the broadcast change gets in. |
…y into task closure We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small. Author: Xiangrui Meng <[email protected]> Closes apache#1427 from mengxr/broadcast-new and squashes the following commits: b9a1228 [Xiangrui Meng] style update b97c184 [Xiangrui Meng] minimal change to LBFGS 9ebadcc [Xiangrui Meng] add task size test to RowMatrix 9427bf0 [Xiangrui Meng] add task size tests to linear methods e0a5cf2 [Xiangrui Meng] add task size test to GD 28a8411 [Xiangrui Meng] add test for NaiveBayes 380778c [Xiangrui Meng] update KMeans test bccab92 [Xiangrui Meng] add task size test to LBFGS 02103ba [Xiangrui Meng] remove print e73d68e [Xiangrui Meng] update tests for k-means 174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize 1928a5a [Xiangrui Meng] add test for KMeans task size e00c2da [Xiangrui Meng] use broadcast in GD, KMeans 010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast
This bundles Boson with Spark 3.2.0. Tested via the following commands: ``` BOSON_CONF_DIR=$HOME/git/boson/conf $SPARK_HOME/bin/spark-shell \ --conf spark.sql.extensions=com.apple.boson.BosonSparkSessionExtensions \ --conf spark.sql.adaptive.forceApply=true ```
We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.