Skip to content

Feature request: Built-in support for ELMO/BERT embeddings #5159

@sfiruch

Description

@sfiruch

Intro

ML.NET is not yet very well equipped for some natural-language processing (NLP) workloads. While there is already support for basic processing steps (tokenization, stop word removal, ...) and sentiment, other, higher-level workloads are not yet supported.

Feature request

  • Pre-trained embeddings, like BERT or GloVe, for documents are useful for down-level tasks.
  • It'd be even better if there was an easy to use API to tune or train custom models on custom datasets.

Use case

In our specific use case, we develop document classifiers. We only have a limited set of labeled documents to train with. Our plan is to use a pretrained or trained document embeddings, and learn a simple classifier on top, using the labeled documents.

Workarounds

There is already a project that runs BERT as ONNX on top of ML.NET, see https://github.com/GerjanVlot/BERT-ML.NET. I'd like to see this become an official part of ML.NET, with a good API, properly maintained and updated.

Outlook

These models are building-blocks for other features, like entity recognition (#630). Ideally ML.NET would support many more NLP tasks, as listed in https://github.com/microsoft/nlp-recipes#content. Generally, we notice an uptake in NLP-related project inquiries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.classificationBugs related classification tasksenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions