Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@parmeet
Copy link
Contributor

@parmeet parmeet commented Mar 8, 2021

Summary:

Update Vocab class to make use of c10::string_view to avoid copy operations during query.

Changes:

Gains:
Reduction in batch ( made up of all the tokens in a line) look-up time compared to Python Dict:

  • AG_NEWS: ~70-75%

  • SoGouNews: ~80-85%

  • Look into if static vocab size need to be changed or if we should implement dynamic vocab

@parmeet parmeet changed the title [WIP][ERROR][Do Not Review] string_views [WIP][Do Not Review] string_views Mar 10, 2021
@parmeet parmeet changed the title [WIP][Do Not Review] string_views [WIP]string_views Mar 12, 2021
@parmeet parmeet changed the title [WIP]string_views [WIP] c10::string_views to avoid copies during query Mar 12, 2021
@parmeet parmeet changed the title [WIP] c10::string_views to avoid copies during query c10::string_views to avoid copies during query Mar 12, 2021
@codecov
Copy link

codecov bot commented Mar 14, 2021

Codecov Report

Merging #1248 (9d7a614) into master (f433716) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1248   +/-   ##
=======================================
  Coverage   78.80%   78.80%           
=======================================
  Files          67       67           
  Lines        3624     3624           
=======================================
  Hits         2856     2856           
  Misses        768      768           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f433716...9d7a614. Read the comment docs.

const int64_t num_lines, const bool sort_tokens) {
StringList _concat_tokens(
std::vector<std::shared_ptr<
ska_ordered::order_preserving_flat_hash_map<std::string, uint32_t>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you also create a typedef for ska_ordered::order_preserving_flat_hash_map<std::string, uint32_t> to shorten the signature a bit?

if (tokens_freq[item.first] - cur_token_freq < min_freq &&
tokens_freq[item.first] >= min_freq) {
unique_tokens.push_back(item.first);
unique_tokens.push_back(std::string{item.first});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't item.first already be of type std::string?

Vocab _load_vocab_from_file(const std::string &file_path,
const std::string &unk_token,
const int64_t min_freq, const int64_t num_cpus) {
const uint32_t min_freq, const uint32_t num_cpus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of changing these arguments from int64 to uint32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will restore the original state, it's a result of trying out something else and unfortunately didn't get clean up properly.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR includes a lot of changes of int64_t to uint32_t. Is this intended?

@parmeet
Copy link
Contributor Author

parmeet commented Mar 18, 2021

The PR includes a lot of changes of int64_t to uint32_t. Is this intended?

Ya, I wanted to change everything to uint32_t, but then ran into compilation issue due to torchbind, then as i reverted it back, some places didn't clean up properly. Let me restore things back to original state. Thanks for catching this.

@parmeet parmeet requested a review from cpuhrsch March 18, 2021 04:18
uint32_t _find(const c10::string_view &w) const {
uint32_t stoi_size = stoi_.size();
uint32_t id = _hash(w) % stoi_size;
while (stoi_[id] != -1 && c10::string_view{itos_[stoi_[id]].data(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this view construction actually necessary to do the comparison? I'd have expected c10::string_view to be comparable with std::string (i.e. the entries of itos_).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's not necessary, nice catch :)

void _add(const std::string &w) {
uint32_t h = _find(c10::string_view{w.data(), w.size()});
if (stoi_[h] == -1) {
itos_.push_back(w);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we reached max size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question, which I think can be addressed in the context of dynamic v/s static dictionary (follow-up item). As of now, the static size is 30 million, which is probably reasonable given the current vocab size typically is orders of magnitude lower.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, see two comments left to resolve before merging.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants