❓ Questions and Help
Description
I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it's something wrong with my customized dataset.
Customized data.Dataset for mulilabel classification is as follows:
class TextMultiLabelDataset(data.Dataset):
def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
# torchtext Field objects
fields = [('text', text_field), ('label', label_field)]
# for l in lbl_cols:
# fields.append((l, label_field))
is_test = True if lbls is None else False
if is_test:
pass
else:
n_labels = len(lbls)
examples = []
for i, txt in enumerate(tqdm(text)):
if not is_test:
l = lbls[i]
else:
l = [0.0] * n_labels
examples.append(data.Example.fromlist([txt, l], fields))
super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)
where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)
examples of text:
[["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]
examples of lbls:
[[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]