Skip to content

[RFC] Support additional output formats for sparse models #3865

@yuye-aws

Description

@yuye-aws

Background

The current sparse encoding model and sparse tokenizer model outputs the sparse embedding in the following format.

{
  "world": 3.4208686,
  "hello": 6.9377565
}

While this format is human-readable, it has the following limitations:

  • Increased storage requirements due to storing string tokens
  • Potential performance overhead when processing string keys
  • Limited compatibility with some vector operations that expect numerical indices

We propose supporting two additional output formats. The first one is Integer-index format. This format replaces word tokens with their corresponding integer IDs from tokenizer, reducing storage requirements while maintaining the map structure.

{
  "2088": 6.9377565
  "7592": 3.4208686,
}

The second one is array-based format, representing a sparse vector with two lists: indices and values. This format represents the sparse vector with parallel arrays of indices and values.

{
  "indices": [2088, 7592],
  "values": [6.9377565, 3.4208686]
}

Options

Option 1: Runtime Format Selection via Request Parameters (Recommended)

Users can specify the desired output format at inference time through the sparse_encoding_format parameter:

POST _plugins/_ml/models/Lkjp8ZYBTIHkDc6TJ4q6/_predict
{
  "text_docs": ["hello world"],
  "parameters": {
    "sparse_encoding_format": "int" // Options: "word" (default), "int" and "array"
  }
} 

Advantages

  • Flexibility to choose output format per request
  • Backward compatible with existing models and implementation

Next Steps

I have implemented this PR as POC. If this option is chosen, I am going to continue working on the array format and sparse tokenizer model.

Option 2: Model Config (with technical challenge)

User can specify the output format during model registration:

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "sparse_encoding_format": "int"
  }
}

Limitation

This approach has a technical challenge. For pre-trained models, there is no use specifying the model_config during model registration. The model config received by the TextEmbeddingSparseEncodingModel class will always be null. Besides, the model configuration of pre-trained models cannot be modified after registration.

Open questions

  1. Should we support a numeric key format without quotes for the integer-index format?
{
  2088: 6.9377565
  7592: 3.4208686
}
  1. For the array-based format, should we guarantee that indices are sorted in ascending order?
  2. How should we handle the case where a user is using remote model?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions