Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Nov 18, 2015

jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.

  1. Performance improvement for less serialization.
  2. Increase the capacity of Word2Vec a lot.
    Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
    the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
    2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46204 has finished for PR 9803 at commit 028138a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaGradientBoostedTreeClassifierExample\n * public class JavaGradientBoostedTreeRegressorExample\n * public class JavaRandomForestClassifierExample\n * public class JavaRandomForestRegressorExample\n

@srowen
Copy link
Member

srowen commented Nov 18, 2015

To make this change, you'd need to argue that some of these fields are only needed on the driver. Is that true? Or else, why is this object being serialized if it's mostly operating on the driver? I recognize it's Serializable. That is to say, I think just marking things @transient is suspicious as it at least indicates suboptimal design somewhere.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Nov 18, 2015

Yes. Word2Vec as an object does not need to be serialized. The mapPartitions in method fit uses some members of the outer class (Word2Vec) and causes the serialization. And before mapPartitions, vocab is wrapped as an broadcast variable already.

@srowen
Copy link
Member

srowen commented Nov 18, 2015

Can we avoid getting Word2Vec in a closure entirely then by modifying its usage?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Nov 18, 2015

Yes, by making quite a few local copy of the member variables in the method.

asfgit pushed a commit that referenced this pull request Nov 18, 2015
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 18, 2015
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 18, 2015
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 18, 2015
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in e391abd Nov 18, 2015
asfgit pushed a commit that referenced this pull request Nov 18, 2015
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 18, 2015
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Nov 18, 2015

It is a good practice in Spark to mark variable transient if we know they are not used remotely. It is tricky to avoid serialize the entire class into a closure because it is not easy to tell which method/variable pulls in the parent object. We can create local variables, but it might not worth the complexity to do so.

LGTM. Merged into all branches since 1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants