-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Background
The current sparse encoding model and sparse tokenizer model outputs the sparse embedding in the following format.
{
"world": 3.4208686,
"hello": 6.9377565
}
While this format is human-readable, it has the following limitations:
- Increased storage requirements due to storing string tokens
- Potential performance overhead when processing string keys
- Limited compatibility with some vector operations that expect numerical indices
We propose supporting two additional output formats. The first one is Integer-index format. This format replaces word tokens with their corresponding integer IDs from tokenizer, reducing storage requirements while maintaining the map structure.
{
"2088": 6.9377565
"7592": 3.4208686,
}
The second one is array-based format, representing a sparse vector with two lists: indices and values. This format represents the sparse vector with parallel arrays of indices and values.
{
"indices": [2088, 7592],
"values": [6.9377565, 3.4208686]
}
Options
Option 1: Runtime Format Selection via Request Parameters (Recommended)
Users can specify the desired output format at inference time through the sparse_encoding_format parameter:
POST _plugins/_ml/models/Lkjp8ZYBTIHkDc6TJ4q6/_predict
{
"text_docs": ["hello world"],
"parameters": {
"sparse_encoding_format": "int" // Options: "word" (default), "int" and "array"
}
}
Advantages
- Flexibility to choose output format per request
- Backward compatible with existing models and implementation
Next Steps
I have implemented this PR as POC. If this option is chosen, I am going to continue working on the array format and sparse tokenizer model.
Option 2: Model Config (with technical challenge)
User can specify the output format during model registration:
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT",
"model_config": {
"sparse_encoding_format": "int"
}
}
Limitation
This approach has a technical challenge. For pre-trained models, there is no use specifying the model_config during model registration. The model config received by the TextEmbeddingSparseEncodingModel class will always be null. Besides, the model configuration of pre-trained models cannot be modified after registration.
Open questions
- Should we support a numeric key format without quotes for the integer-index format?
{
2088: 6.9377565
7592: 3.4208686
}
- For the array-based format, should we guarantee that indices are sorted in ascending order?
- How should we handle the case where a user is using remote model?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status