-
Notifications
You must be signed in to change notification settings - Fork 814
c10::string_views to avoid copies during query #1248
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1248 +/- ##
=======================================
Coverage 78.80% 78.80%
=======================================
Files 67 67
Lines 3624 3624
=======================================
Hits 2856 2856
Misses 768 768 Continue to review full report at Codecov.
|
torchtext/csrc/vocab.cpp
Outdated
| const int64_t num_lines, const bool sort_tokens) { | ||
| StringList _concat_tokens( | ||
| std::vector<std::shared_ptr< | ||
| ska_ordered::order_preserving_flat_hash_map<std::string, uint32_t>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you also create a typedef for ska_ordered::order_preserving_flat_hash_map<std::string, uint32_t> to shorten the signature a bit?
torchtext/csrc/vocab.cpp
Outdated
| if (tokens_freq[item.first] - cur_token_freq < min_freq && | ||
| tokens_freq[item.first] >= min_freq) { | ||
| unique_tokens.push_back(item.first); | ||
| unique_tokens.push_back(std::string{item.first}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't item.first already be of type std::string?
torchtext/csrc/vocab.cpp
Outdated
| Vocab _load_vocab_from_file(const std::string &file_path, | ||
| const std::string &unk_token, | ||
| const int64_t min_freq, const int64_t num_cpus) { | ||
| const uint32_t min_freq, const uint32_t num_cpus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of changing these arguments from int64 to uint32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will restore the original state, it's a result of trying out something else and unfortunately didn't get clean up properly.
cpuhrsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR includes a lot of changes of int64_t to uint32_t. Is this intended?
Ya, I wanted to change everything to uint32_t, but then ran into compilation issue due to torchbind, then as i reverted it back, some places didn't clean up properly. Let me restore things back to original state. Thanks for catching this. |
torchtext/csrc/vocab.h
Outdated
| uint32_t _find(const c10::string_view &w) const { | ||
| uint32_t stoi_size = stoi_.size(); | ||
| uint32_t id = _hash(w) % stoi_size; | ||
| while (stoi_[id] != -1 && c10::string_view{itos_[stoi_[id]].data(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this view construction actually necessary to do the comparison? I'd have expected c10::string_view to be comparable with std::string (i.e. the entries of itos_).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's not necessary, nice catch :)
| void _add(const std::string &w) { | ||
| uint32_t h = _find(c10::string_view{w.data(), w.size()}); | ||
| if (stoi_[h] == -1) { | ||
| itos_.push_back(w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we reached max size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question, which I think can be addressed in the context of dynamic v/s static dictionary (follow-up item). As of now, the static size is 30 million, which is probably reasonable given the current vocab size typically is orders of magnitude lower.
cpuhrsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, see two comments left to resolve before merging.
Summary:
Update Vocab class to make use of c10::string_view to avoid copy operations during query.
Changes:
Gains:
Reduction in batch ( made up of all the tokens in a line) look-up time compared to Python Dict:
AG_NEWS: ~70-75%
SoGouNews: ~80-85%
Look into if static vocab size need to be changed or if we should implement dynamic vocab