Skip to content

Conversation

@rezazadeh
Copy link
Contributor

Use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark.

Another PR coming to update documentation.

@rezazadeh rezazadeh changed the title [MLlib] [SPARK-6713] Iterators in columnSimilarities for flatMap [MLlib] [SPARK-6713] Iterators in columnSimilarities for mapPartitionsWithIndex Apr 5, 2015
@srowen
Copy link
Member

srowen commented Apr 5, 2015

Yeah that looks like a great change to avoid allocating so much memory at once.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29721/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Apr 6, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in 30363ed Apr 6, 2015
karlhigley added a commit to karlhigley/lexrank-summarizer that referenced this pull request Jun 1, 2015
This could happen during mapPartitionsWithIndex in a dense and large column -
this way Spark can spill the pairs onto disk instead of building all the pairs
before handing them to Spark.

(See SPARK-6713, apache/spark#5364, from which this code
is lifted.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants