KeyError in vocab if the vocab is built on a subset of the initially read data

So I'm getting a strange issue, where I'm trying to read in a dataset (from a single file), split it into a train, dev, and test set. If I read it in using TabularDataset, then split the data, and train the vocab on the first split, I get KeyErrors, however if I split the dataset files prior to reading it in, no such errors occur.

Dataset I've been running into this issue: https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data

**To Reproduce**
1. Read in data
2. Split the data (using tabulardataset.split) into n sets
3. Build your vocab on the training set
4. Iterate over dev/test set

 - PyTorch Version (e.g., 1.0): 1.20
 - OS (e.g., Linux): OSX
 - How you installed PyTorch (`conda`, `pip`, source): pip
 - Python version: 3.7


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KeyError in vocab if the vocab is built on a subset of the initially read data #642

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KeyError in vocab if the vocab is built on a subset of the initially read data #642

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions