Memory overflow using dict type features

I am trying to use Word vector features for training my CRF model for Named entity recognition.I am using sklear-crfsuite for training my model which follows the same convention for creating features as this library and is a wrapper around python-crfsuite. 

The word vector features result in 300 additional dense features per token in the sequence. Because python-crfsuite uses dictionaries to specify the features, my training data itself ends up taking 6 times more memory compared to that used by CRFsuite native binary with the training data in text files. (35 GB vs 6.5 GB). 

Is there a more memory efficient way to specify the features instead of only using dicts. I believe the `ItemSequence` is a way to pass Sequences instead of using the list of dictionaries. Can I also pass generators for my training features instead of the list of training features, to keep the memory overhead low ?

I know `sklearn-crfsuite` uses the `six.moves.zip` function which returns an iterator instead of a list of zipped `X,y`, so I believe it should be possible. 

Can there be an example for using the `ItemSequence` class for giving the input data?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory overflow using dict type features #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory overflow using dict type features #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions