Skip to content

Commit 307f27e

Browse files
hhbyyhmengxr
authored andcommitted
[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec
jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <[email protected]> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abd) Signed-off-by: Xiangrui Meng <[email protected]>
1 parent 4b6e24e commit 307f27e

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -127,8 +127,8 @@ class Word2Vec extends Serializable with Logging {
127127

128128
private var trainWordsCount = 0
129129
private var vocabSize = 0
130-
private var vocab: Array[VocabWord] = null
131-
private var vocabHash = mutable.HashMap.empty[String, Int]
130+
@transient private var vocab: Array[VocabWord] = null
131+
@transient private var vocabHash = mutable.HashMap.empty[String, Int]
132132

133133
private def learnVocab(words: RDD[String]): Unit = {
134134
vocab = words.map(w => (w, 1))

0 commit comments

Comments
 (0)