-
Notifications
You must be signed in to change notification settings - Fork 814
Remove <unk> token and index from experimental Vocab #1027
base: main
Are you sure you want to change the base?
Conversation
|
cc @bentrevett Please review the PR and let us know if you have any other suggestions. |
|
@zhangguanheng66 All looks good to me. |
|
Maybe "default" is a better name than "fallback" since it's akin to the default kwarg passed to dict.get. |
| self.assertEqual(v['not_in_it'], 0) | ||
| v.insert_token('not_in_it', 0) | ||
| v.set_default_index(0) | ||
| self.assertEqual(v.get_default_index(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably also want to check what numbers these tokens correspond to
self.assertEqual(v['not_in_it'], 0)
self.assertEqual(v['<unk>'], 0)
|
Related to this would be a test that verifies the behavior of insert_token( |
test/experimental/test_vocab.py
Outdated
| def test_has_no_unk(self): | ||
| c = OrderedDict() | ||
| v = vocab(c) | ||
| self.assertEqual(v.get_default_index(), -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better if this were to return "None". You should be able to do this easily by using c10::optional<int64_t> instead of int64_t in the C++ code.
|
As an aside, while we're introducing this for the Vocab we should probably as a follow-up also introduce the same concepts to the Vectors class |
82a14a9 to
b40d8dd
Compare
cpuhrsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we merge this we also need to support resassignment. A special token might show up in the dataset and ends up inadvertently being mapped to the wrong index. For example, might show up in the dataset used to build this Vocab, but really the user wants it to be mapped to index "0" (which is what Vocab currrently does).
This PR is to remove the default
'<unk>'token along with the index fromexperimental.vocab. Fix #1016In the experimental vocabulary, there will be no special symbols or user reserved symbols. Instead, we add a builtin index for the default scenario, and users are required to call
set_default_indexfunc explicitly to reset the default index. If not reset, the vocabulary will throw out error message for the default scenario. With theset_default_indexfunction, users will have the flexibility to have or not have default index. For the special symbols (e.g.'<unk>','<pad>'), users should insert the tokens with the existing methodself.insert_token(token: str, index: int). Later on, when users need the index of the special symbols, they can obtain them by calling the vocab instance. For example: