[SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec #9803

hhbyyh · 2015-11-18T09:32:41Z

jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.

Performance improvement for less serialization.
Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

SparkQA · 2015-11-18T10:32:01Z

Test build #46204 has finished for PR 9803 at commit 028138a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaGradientBoostedTreeClassifierExample\n * public class JavaGradientBoostedTreeRegressorExample\n * public class JavaRandomForestClassifierExample\n * public class JavaRandomForestRegressorExample\n

srowen · 2015-11-18T10:47:36Z

To make this change, you'd need to argue that some of these fields are only needed on the driver. Is that true? Or else, why is this object being serialized if it's mostly operating on the driver? I recognize it's Serializable. That is to say, I think just marking things @transient is suspicious as it at least indicates suboptimal design somewhere.

hhbyyh · 2015-11-18T12:26:22Z

Yes. Word2Vec as an object does not need to be serialized. The mapPartitions in method fit uses some members of the outer class (Word2Vec) and causes the serialization. And before mapPartitions, vocab is wrapped as an broadcast variable already.

srowen · 2015-11-18T12:29:15Z

Can we avoid getting Word2Vec in a closure entirely then by modifying its usage?

hhbyyh · 2015-11-18T12:31:41Z

Yes, by making quite a few local copy of the member variables in the method.

jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <[email protected]> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abd) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2015-11-18T21:29:37Z

It is a good practice in Spark to mark variable transient if we know they are not used remotely. It is tricky to avoid serialize the entire class into a closure because it is not easy to tell which method/variable pulls in the parent object. We can create local variables, but it might not worth the complexity to do so.

LGTM. Merged into all branches since 1.1.

make vocab transient

028138a

asfgit closed this in e391abd Nov 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec #9803

[SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec #9803

Uh oh!

hhbyyh commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 18, 2015

Uh oh!

srowen commented Nov 18, 2015

Uh oh!

hhbyyh commented Nov 18, 2015

Uh oh!

srowen commented Nov 18, 2015

Uh oh!

hhbyyh commented Nov 18, 2015

Uh oh!

mengxr commented Nov 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec #9803

[SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec #9803

Uh oh!

Conversation

hhbyyh commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 18, 2015

Uh oh!

srowen commented Nov 18, 2015

Uh oh!

hhbyyh commented Nov 18, 2015

Uh oh!

srowen commented Nov 18, 2015

Uh oh!

hhbyyh commented Nov 18, 2015

Uh oh!

mengxr commented Nov 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants