[RFC] Support additional output formats for sparse models

# Background

The current **sparse encoding model** and **sparse tokenizer model** outputs the sparse embedding in the following format.

```
{
  "world": 3.4208686,
  "hello": 6.9377565
}
```

While this format is human-readable, it has the following limitations:

* Increased storage requirements due to storing string tokens
* Potential performance overhead when processing string keys
* Limited compatibility with some vector operations that expect numerical indices

We propose supporting two additional output formats. The first one is Integer-index format. This format replaces word tokens with their corresponding integer IDs from tokenizer, reducing storage requirements while maintaining the map structure.

```
{
  "2088": 6.9377565
  "7592": 3.4208686,
}
```

The second one is array-based format, representing a sparse vector with two lists: **indices** and **values**. This format represents the sparse vector with parallel arrays of indices and values.

```
{
  "indices": [2088, 7592],
  "values": [6.9377565, 3.4208686]
}
```

# Options

## Option 1: Runtime Format Selection via Request Parameters (Recommended)

Users can specify the desired output format at inference time through the **sparse_encoding_format** parameter:

```
POST _plugins/_ml/models/Lkjp8ZYBTIHkDc6TJ4q6/_predict
{
  "text_docs": ["hello world"],
  "parameters": {
    "sparse_encoding_format": "int" // Options: "word" (default), "int" and "array"
  }
} 
```

### Advantages

* Flexibility to choose output format per request
* Backward compatible with existing models and implementation

### Next Steps

I have implemented this **[PR](https://github.com/opensearch-project/ml-commons/pull/3863)** as POC. If this option is chosen, I am going to continue working on the array format and sparse tokenizer model.

## Option 2: Model Config (with technical challenge)

User can specify the output format during model registration:

```
POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "sparse_encoding_format": "int"
  }
}
```

### Limitation

This approach has a technical challenge. For pre-trained models, there is no use specifying the model_config during model registration. The model config received by the TextEmbeddingSparseEncodingModel class will always be **null**. Besides, the model configuration of pre-trained models cannot be modified after registration.

## Open questions

1. Should we support a numeric key format without quotes for the integer-index format?

```
{
  2088: 6.9377565
  7592: 3.4208686
}
```

2. For the array-based format, should we guarantee that indices are sorted in ascending order?
3. How should we handle the case where a user is using remote model?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Support additional output formats for sparse models #3865

Background

Options

Option 1: Runtime Format Selection via Request Parameters (Recommended)

Advantages

Next Steps

Option 2: Model Config (with technical challenge)

Limitation

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Support additional output formats for sparse models #3865

Description

Background

Options

Option 1: Runtime Format Selection via Request Parameters (Recommended)

Advantages

Next Steps

Option 2: Model Config (with technical challenge)

Limitation

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions