-
Notifications
You must be signed in to change notification settings - Fork 221
Description
I am trying to use Word vector features for training my CRF model for Named entity recognition.I am using sklear-crfsuite for training my model which follows the same convention for creating features as this library and is a wrapper around python-crfsuite.
The word vector features result in 300 additional dense features per token in the sequence. Because python-crfsuite uses dictionaries to specify the features, my training data itself ends up taking 6 times more memory compared to that used by CRFsuite native binary with the training data in text files. (35 GB vs 6.5 GB).
Is there a more memory efficient way to specify the features instead of only using dicts. I believe the ItemSequence
is a way to pass Sequences instead of using the list of dictionaries. Can I also pass generators for my training features instead of the list of training features, to keep the memory overhead low ?
I know sklearn-crfsuite
uses the six.moves.zip
function which returns an iterator instead of a list of zipped X,y
, so I believe it should be possible.
Can there be an example for using the ItemSequence
class for giving the input data?