Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions docs/reference/query-dsl/script-score-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,110 @@ between a given query vector and document vectors.
--------------------------------------------------
// NOTCONSOLE

For dense_vector fields, `l1norm` calculates L^1^ distance
(Manhattan distance) between a given query vector and
document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "l1norm(params.queryVector, doc['my_dense_vector'])",
"params": {
"queryVector": [4, 3.4, -0.2]
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE

Note that, unlike `cosineSimilarity` that represent
similarity, `l1norm` and the shown below `l2norm` represent distances or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not exactly sure since I'm no native speaker, but I would expect "the l2norm shown below".

differences. This means, that the mose similar are vectors,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mose -> more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the more similar two vector are, the smaller is the score produced by the ...

the less will be the scores produced by `l1norm` and `l2norm` functions.
Thus, if you need more similar vectors to score higher, you should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler and easier to understand: This means you need to reverse... if you want vectors to increase the search score.

reverse the output from `l1norm` and `l2norm`:

`"source": " 1/ l1norm(params.queryVector, doc['my_dense_vector'])"`

For sparse_vector fields, `l1normSparse` calculates L^1^ distance
between a given query vector and document vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just me, but this sounds a bit like the distance calculation is different from the above? I guess its not, you just need this for sparse vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Thanks Christoph. Right, this is exactly same function as l1norm but for sparse vectors. We have all functions for dense vectors being duplicated with sparse ending for sparse vectors.


[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "l1normSparse(params.queryVector, doc['my_sparse_vector'])",
"params": {
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE

For dense_vector fields, `l2norm` calculates L^2^ distance
(Euclidean distance) between a given query vector and
document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "l2norm(params.queryVector, doc['my_dense_vector'])",
"params": {
"queryVector": [4, 3.4, -0.2]
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE

Similarly, for sparse_vector fields, `l2normSparse` calculates L^2^ distance
between a given query vector and document vectors.

[source,js]
--------------------------------------------------
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "l2normSparse(params.queryVector, doc['my_sparse_vector'])",
"params": {
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE


NOTE: If a document doesn't have a value for a vector field on which
a vector function is executed, 0 is returned as a result
for this document.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,53 @@ public class ScoreScriptUtils {

//**************FUNCTIONS FOR DENSE VECTORS

/**
* Calculate l1 norm - Manhattan distance
* between a query's dense vector and documents' dense vectors
*
* @param queryVector the query vector parsed as {@code List<Number>} from json
* @param dvs VectorScriptDocValues representing encoded documents' vectors
*/
public static double l1norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){
BytesRef value = dvs.getEncodedValue();
if (value == null) return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: enclose block in brackets

float[] docVector = VectorEncoderDecoder.decodeDenseVector(value);

int dims = Math.min(queryVector.size(), docVector.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the function throw an error if the two vectors are of different length? Since this is the non-sparse case I think the result might otherwise be misleading.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, and I will adjust the code accordingly. I think we have decided to be lenient for missing dimensions. Is this fine with you? I will add more explanation about leniency for missing dimensions to the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand where being lenient for dense vectors would be useful here. When would you want to calculate the l1-norm for two vectors of differnt dimensionality? I would think this is always a programming error that need correction (so an error would be better) in most cases, but maybe I'm missing something.

int dim = 0;
double l1norm = 0;
Iterator<Number> queryVectorIter = queryVector.iterator();
while(dim < dims) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space

l1norm += Math.abs(queryVectorIter.next().doubleValue() - docVector[dim]);
dim++;
}
return l1norm;
}

/**
* Calculate l2 norm - Euclidean distance
* between a query's dense vector and documents' dense vectors
*
* @param queryVector the query vector parsed as {@code List<Number>} from json
* @param dvs VectorScriptDocValues representing encoded documents' vectors
*/
public static double l2norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){
BytesRef value = dvs.getEncodedValue();
if (value == null) return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

float[] docVector = VectorEncoderDecoder.decodeDenseVector(value);

int dims = Math.min(queryVector.size(), docVector.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, as as user I think I'd like to get an error if the dimensions differ rather than get an "incomplete" result.

int dim = 0;
double l2norm = 0;
Iterator<Number> queryVectorIter = queryVector.iterator();
while(dim < dims) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space

double diff = queryVectorIter.next().doubleValue() - docVector[dim];
l2norm += diff * diff;
dim++;
}
return Math.sqrt(l2norm);
}

/**
* Calculate a dot product between a query's dense vector and documents' dense vectors
*
Expand Down Expand Up @@ -100,6 +147,122 @@ private static double intDotProduct(List<Number> v1, float[] v2){


//**************FUNCTIONS FOR SPARSE VECTORS
/**
* Calculate l1 norm - Manhattan distance
* between a query's sparse vector and documents' sparse vectors
*
* L1NormSparse is implemented as a class to use
* painless script caching to prepare queryVector
* only once per script execution for all documents.
* A user will call `l1normSparse(params.queryVector, doc['my_vector'])`
*/
public static final class L1NormSparse {
final double[] queryValues;
final int[] queryDims;

// prepare queryVector once per script execution
// queryVector represents a map of dimensions to values
public L1NormSparse(Map<String, Number> queryVector) {
//break vector into two arrays dims and values
int n = queryVector.size();
queryDims = new int[n];
queryValues = new double[n];
int i = 0;
for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) {
try {
queryDims[i] = Integer.parseInt(dimValue.getKey());
} catch (final NumberFormatException e) {
throw new IllegalArgumentException("Failed to parse a query vector dimension, it must be an integer!", e);
}
queryValues[i] = dimValue.getValue().doubleValue();
i++;
}
// Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions
sortSparseDimsDoubleValues(queryDims, queryValues, n);
}

public double l1normSparse(VectorScriptDocValues.SparseVectorScriptDocValues dvs) {
BytesRef value = dvs.getEncodedValue();
if (value == null) return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: block with brackets

int[] docDims = VectorEncoderDecoder.decodeSparseVectorDims(value);
float[] docValues = VectorEncoderDecoder.decodeSparseVector(value);
int queryIndex = 0;
int docIndex = 0;
double l1norm = 0;
// find common dimensions among vectors v1 and v2 and calculate l1norm based on common dimensions
while (queryIndex < queryDims.length && docIndex < docDims.length) {
if (queryDims[queryIndex] == docDims[docIndex]) {
l1norm += Math.abs(queryValues[queryIndex] - docValues[docIndex]);
queryIndex++;
docIndex++;
} else if (queryDims[queryIndex] > docDims[docIndex]) {
docIndex++;
} else {
queryIndex++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question about computing these norms for sparse vectors. The way this is implemented currently it looks like it is completely ignoring values that are not present in either of the two vectors. e.g. if a = [1:1, 2:1] and b = [3:1 4:1] then the result is 0. I would expect it to be 4 though.
My assumtion (at least thats how I remember sparse vector behaviour from other libraries) is that each dimension that is not present is implicitely regarded as a 0, so it should be taken into account in the computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher thanks, this is a great feedback. I will adjust the code accordingly

}
return l1norm;
}
}

/**
* Calculate l2 norm - Euclidean distance
* between a query's sparse vector and documents' sparse vectors
*
* L2NormSparse is implemented as a class to use
* painless script caching to prepare queryVector
* only once per script execution for all documents.
* A user will call `l2normSparse(params.queryVector, doc['my_vector'])`
*/
public static final class L2NormSparse {
final double[] queryValues;
final int[] queryDims;

// prepare queryVector once per script execution
// queryVector represents a map of dimensions to values
public L2NormSparse(Map<String, Number> queryVector) {
//break vector into two arrays dims and values
int n = queryVector.size();
queryDims = new int[n];
queryValues = new double[n];
int i = 0;
for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) {
try {
queryDims[i] = Integer.parseInt(dimValue.getKey());
} catch (final NumberFormatException e) {
throw new IllegalArgumentException("Failed to parse a query vector dimension, it must be an integer!", e);
}
queryValues[i] = dimValue.getValue().doubleValue();
i++;
}
// Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions
sortSparseDimsDoubleValues(queryDims, queryValues, n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost identical to the constructor in the above class. Maybe this can be shared to large extents.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Thanks, Christoph. Indeed I was also thinking how to share the code. But I am working under 2 constraints here:

  1. painless script. Painless script has requirements how the code should be structured to use caching and bindings
  2. performance. I was thinking to have a singular function that iterates over query and documents, and pass it a lambda - BiFunction as an argument that tells what computation needs to be done. But BiFunction accepts only classes, and I did not want to covert primitive floats to Float instances. This will significantly slow down computations.

Still, I will keep thinking how this code can be restructured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I didn't thing about performance issues about boxing in lambdas. Does Painless prohibit e.g. L1NormSparse and L2NormSparse sharing a common abstract superclass for e.g. sharing the common code in the constructor?

}

public double l2normSparse(VectorScriptDocValues.SparseVectorScriptDocValues dvs) {
BytesRef value = dvs.getEncodedValue();
if (value == null) return 0;
int[] docDims = VectorEncoderDecoder.decodeSparseVectorDims(value);
float[] docValues = VectorEncoderDecoder.decodeSparseVector(value);
int queryIndex = 0;
int docIndex = 0;
double l2norm = 0;
// find common dimensions among vectors v1 and v2 and calculate l1norm based on common dimensions
while (queryIndex < queryDims.length && docIndex < docDims.length) {
if (queryDims[queryIndex] == docDims[docIndex]) {
double diff = queryValues[queryIndex] - docValues[docIndex];
l2norm += diff * diff;
queryIndex++;
docIndex++;
} else if (queryDims[queryIndex] > docDims[docIndex]) {
docIndex++;
} else {
queryIndex++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general remarks as above. Also, the implementations of the two norms look very similar with the exception of how the vector diffs are treated (squared vs. just summing them). I think it would be worth trying to share huge parts of the function and maybe only use a differing lambda to do the different calculations.

}
return Math.sqrt(l2norm);
}
}

/**
* Calculate a dot product between a query's sparse vector and documents' sparse vectors
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,12 @@ class org.elasticsearch.index.query.VectorScriptDocValues$SparseVectorScriptDocV
}

static_import {
double l1norm(List, VectorScriptDocValues.DenseVectorScriptDocValues) from_class org.elasticsearch.index.query.ScoreScriptUtils
double l2norm(List, VectorScriptDocValues.DenseVectorScriptDocValues) from_class org.elasticsearch.index.query.ScoreScriptUtils
double cosineSimilarity(List, VectorScriptDocValues.DenseVectorScriptDocValues) bound_to org.elasticsearch.index.query.ScoreScriptUtils$CosineSimilarity
double dotProduct(List, VectorScriptDocValues.DenseVectorScriptDocValues) from_class org.elasticsearch.index.query.ScoreScriptUtils
double l1normSparse(Map, VectorScriptDocValues.SparseVectorScriptDocValues) bound_to org.elasticsearch.index.query.ScoreScriptUtils$L1NormSparse
double l2normSparse(Map, VectorScriptDocValues.SparseVectorScriptDocValues) bound_to org.elasticsearch.index.query.ScoreScriptUtils$L2NormSparse
double dotProductSparse(Map, VectorScriptDocValues.SparseVectorScriptDocValues) bound_to org.elasticsearch.index.query.ScoreScriptUtils$DotProductSparse
double cosineSimilaritySparse(Map, VectorScriptDocValues.SparseVectorScriptDocValues) bound_to org.elasticsearch.index.query.ScoreScriptUtils$CosineSimilaritySparse
}
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
import org.elasticsearch.index.query.ScoreScriptUtils.CosineSimilarity;
import org.elasticsearch.index.query.ScoreScriptUtils.DotProductSparse;
import org.elasticsearch.index.query.ScoreScriptUtils.CosineSimilaritySparse;
import org.elasticsearch.index.query.ScoreScriptUtils.L1NormSparse;
import org.elasticsearch.index.query.ScoreScriptUtils.L2NormSparse;

import java.util.Arrays;
import java.util.HashMap;
Expand All @@ -33,6 +35,9 @@

import static org.elasticsearch.index.mapper.VectorEncoderDecoderTests.mockEncodeDenseVector;
import static org.elasticsearch.index.query.ScoreScriptUtils.dotProduct;
import static org.elasticsearch.index.query.ScoreScriptUtils.l1norm;
import static org.elasticsearch.index.query.ScoreScriptUtils.l2norm;

import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

Expand All @@ -53,6 +58,14 @@ public void testDenseVectorFunctions() {
CosineSimilarity cosineSimilarity = new CosineSimilarity(queryVector);
double result2 = cosineSimilarity.cosineSimilarity(dvs);
assertEquals("cosineSimilarity result is not equal to the expected value!", 0.78, result2, 0.1);

// test l1Norm
double result3 = l1norm(queryVector, dvs);
assertEquals("l1norm result is not equal to the expected value!", 485.18, result3, 0.1);

// test l2norm
double result4 = l2norm(queryVector, dvs);
assertEquals("l2norm result is not equal to the expected value!", 301.36, result4, 0.1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see an additional test for differing vector lengths, probably asserting that this throws an error if we decide to go that route.

}

public void testSparseVectorFunctions() {
Expand All @@ -78,5 +91,15 @@ public void testSparseVectorFunctions() {
CosineSimilaritySparse cosineSimilaritySparse = new CosineSimilaritySparse(queryVector);
double result2 = cosineSimilaritySparse.cosineSimilaritySparse(dvs);
assertEquals("cosineSimilaritySparse result is not equal to the expected value!", 0.78, result2, 0.1);

// test l1norm
L1NormSparse l1Norm = new L1NormSparse(queryVector);
double result3 = l1Norm.l1normSparse(dvs);
assertEquals("l1normSparse result is not equal to the expected value!", 485.18, result3, 0.1);

// test l2norm
L2NormSparse l2Norm = new L2NormSparse(queryVector);
double result4 = l2Norm.l2normSparse(dvs);
assertEquals("l2normSparse result is not equal to the expected value!", 301.36, result4, 0.1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don't cover the cases mentioned above where the queryVector contains dimensions not present in the document vector and vice versa.

}
}
Loading