Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Aug 30, 2015

  • do not cache first cost RDD
  • change following cost RDD cache level to MEMORY_AND_DISK
  • remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: @yu-iskw @hujiayin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not calling BLAS here because runs == 1 in most cases

@SparkQA
Copy link

SparkQA commented Aug 30, 2015

Test build #41801 has finished for PR 8526 at commit 71db540.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaTrainValidationSplitExample
    • case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
    • case class UnionNode(children: Seq[LocalNode]) extends LocalNode

@hujy
Copy link
Contributor

hujy commented Aug 31, 2015

The fix reduces around 50G RDD based on data size below. The performance is improved. The user needs more than 8G memory to run the kmeans in Spark1.5 based on this data size. The data size:
Number of cluster: 5
Sample dimensions: 20
Number of samples: 1200000000
Sample per input file: 40000000
K: 10
Converge distance: 0.5
Max iteration: 10

@hujy
Copy link
Contributor

hujy commented Aug 31, 2015

LGTM

asfgit pushed a commit that referenced this pull request Aug 31, 2015
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <[email protected]>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Aug 31, 2015
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <[email protected]>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Aug 31, 2015
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <[email protected]>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in f0f563a Aug 31, 2015
@mengxr mengxr changed the title [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton [SPARK-10354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton Aug 31, 2015
@mengxr
Copy link
Contributor Author

mengxr commented Aug 31, 2015

@hujiayin Thanks for testing! Merged into master, branch-1.5, 1.4, and 1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants