Skip to content

Memory overflow using dict type features #37

@napsternxg

Description

@napsternxg

I am trying to use Word vector features for training my CRF model for Named entity recognition.I am using sklear-crfsuite for training my model which follows the same convention for creating features as this library and is a wrapper around python-crfsuite.

The word vector features result in 300 additional dense features per token in the sequence. Because python-crfsuite uses dictionaries to specify the features, my training data itself ends up taking 6 times more memory compared to that used by CRFsuite native binary with the training data in text files. (35 GB vs 6.5 GB).

Is there a more memory efficient way to specify the features instead of only using dicts. I believe the ItemSequence is a way to pass Sequences instead of using the list of dictionaries. Can I also pass generators for my training features instead of the list of training features, to keep the memory overhead low ?

I know sklearn-crfsuite uses the six.moves.zip function which returns an iterator instead of a list of zipped X,y, so I believe it should be possible.

Can there be an example for using the ItemSequence class for giving the input data?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions