-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add l1norm and l2norm distances for vectors #40255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -173,6 +173,110 @@ between a given query vector and document vectors. | |
| -------------------------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| For dense_vector fields, `l1norm` calculates L^1^ distance | ||
| (Manhattan distance) between a given query vector and | ||
| document vectors. | ||
|
|
||
| [source,js] | ||
| -------------------------------------------------- | ||
| { | ||
| "query": { | ||
| "script_score": { | ||
| "query": { | ||
| "match_all": {} | ||
| }, | ||
| "script": { | ||
| "source": "l1norm(params.queryVector, doc['my_dense_vector'])", | ||
| "params": { | ||
| "queryVector": [4, 3.4, -0.2] | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| Note that, unlike `cosineSimilarity` that represent | ||
| similarity, `l1norm` and the shown below `l2norm` represent distances or | ||
| differences. This means, that the mose similar are vectors, | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mose -> more
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... the more similar two vector are, the smaller is the score produced by the ... |
||
| the less will be the scores produced by `l1norm` and `l2norm` functions. | ||
| Thus, if you need more similar vectors to score higher, you should | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably simpler and easier to understand: This means you need to reverse... if you want vectors to increase the search score. |
||
| reverse the output from `l1norm` and `l2norm`: | ||
|
|
||
| `"source": " 1/ l1norm(params.queryVector, doc['my_dense_vector'])"` | ||
|
|
||
| For sparse_vector fields, `l1normSparse` calculates L^1^ distance | ||
| between a given query vector and document vectors. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe just me, but this sounds a bit like the distance calculation is different from the above? I guess its not, you just need this for sparse vectors?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cbuescher Thanks Christoph. Right, this is exactly same function as |
||
|
|
||
| [source,js] | ||
| -------------------------------------------------- | ||
| { | ||
| "query": { | ||
| "script_score": { | ||
| "query": { | ||
| "match_all": {} | ||
| }, | ||
| "script": { | ||
| "source": "l1normSparse(params.queryVector, doc['my_sparse_vector'])", | ||
| "params": { | ||
| "queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0} | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| For dense_vector fields, `l2norm` calculates L^2^ distance | ||
| (Euclidean distance) between a given query vector and | ||
| document vectors. | ||
|
|
||
| [source,js] | ||
| -------------------------------------------------- | ||
| { | ||
| "query": { | ||
| "script_score": { | ||
| "query": { | ||
| "match_all": {} | ||
| }, | ||
| "script": { | ||
| "source": "l2norm(params.queryVector, doc['my_dense_vector'])", | ||
| "params": { | ||
| "queryVector": [4, 3.4, -0.2] | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| Similarly, for sparse_vector fields, `l2normSparse` calculates L^2^ distance | ||
| between a given query vector and document vectors. | ||
|
|
||
| [source,js] | ||
| -------------------------------------------------- | ||
| { | ||
| "query": { | ||
| "script_score": { | ||
| "query": { | ||
| "match_all": {} | ||
| }, | ||
| "script": { | ||
| "source": "l2normSparse(params.queryVector, doc['my_sparse_vector'])", | ||
| "params": { | ||
| "queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0} | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
|
|
||
| NOTE: If a document doesn't have a value for a vector field on which | ||
| a vector function is executed, 0 is returned as a result | ||
| for this document. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,6 +32,53 @@ public class ScoreScriptUtils { | |
|
|
||
| //**************FUNCTIONS FOR DENSE VECTORS | ||
|
|
||
| /** | ||
| * Calculate l1 norm - Manhattan distance | ||
| * between a query's dense vector and documents' dense vectors | ||
| * | ||
| * @param queryVector the query vector parsed as {@code List<Number>} from json | ||
| * @param dvs VectorScriptDocValues representing encoded documents' vectors | ||
| */ | ||
| public static double l1norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){ | ||
| BytesRef value = dvs.getEncodedValue(); | ||
| if (value == null) return 0; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: enclose block in brackets |
||
| float[] docVector = VectorEncoderDecoder.decodeDenseVector(value); | ||
|
|
||
| int dims = Math.min(queryVector.size(), docVector.length); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the function throw an error if the two vectors are of different length? Since this is the non-sparse case I think the result might otherwise be misleading.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, and I will adjust the code accordingly. I think we have decided to be lenient for missing dimensions. Is this fine with you? I will add more explanation about leniency for missing dimensions to the documentation.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand where being lenient for dense vectors would be useful here. When would you want to calculate the l1-norm for two vectors of differnt dimensionality? I would think this is always a programming error that need correction (so an error would be better) in most cases, but maybe I'm missing something. |
||
| int dim = 0; | ||
| double l1norm = 0; | ||
| Iterator<Number> queryVectorIter = queryVector.iterator(); | ||
| while(dim < dims) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: space |
||
| l1norm += Math.abs(queryVectorIter.next().doubleValue() - docVector[dim]); | ||
| dim++; | ||
| } | ||
| return l1norm; | ||
| } | ||
|
|
||
| /** | ||
| * Calculate l2 norm - Euclidean distance | ||
| * between a query's dense vector and documents' dense vectors | ||
| * | ||
| * @param queryVector the query vector parsed as {@code List<Number>} from json | ||
| * @param dvs VectorScriptDocValues representing encoded documents' vectors | ||
| */ | ||
| public static double l2norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){ | ||
| BytesRef value = dvs.getEncodedValue(); | ||
| if (value == null) return 0; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above |
||
| float[] docVector = VectorEncoderDecoder.decodeDenseVector(value); | ||
|
|
||
| int dims = Math.min(queryVector.size(), docVector.length); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here, as as user I think I'd like to get an error if the dimensions differ rather than get an "incomplete" result. |
||
| int dim = 0; | ||
| double l2norm = 0; | ||
| Iterator<Number> queryVectorIter = queryVector.iterator(); | ||
| while(dim < dims) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: space |
||
| double diff = queryVectorIter.next().doubleValue() - docVector[dim]; | ||
| l2norm += diff * diff; | ||
| dim++; | ||
| } | ||
| return Math.sqrt(l2norm); | ||
| } | ||
|
|
||
| /** | ||
| * Calculate a dot product between a query's dense vector and documents' dense vectors | ||
| * | ||
|
|
@@ -100,6 +147,122 @@ private static double intDotProduct(List<Number> v1, float[] v2){ | |
|
|
||
|
|
||
| //**************FUNCTIONS FOR SPARSE VECTORS | ||
| /** | ||
| * Calculate l1 norm - Manhattan distance | ||
| * between a query's sparse vector and documents' sparse vectors | ||
| * | ||
| * L1NormSparse is implemented as a class to use | ||
| * painless script caching to prepare queryVector | ||
| * only once per script execution for all documents. | ||
| * A user will call `l1normSparse(params.queryVector, doc['my_vector'])` | ||
| */ | ||
| public static final class L1NormSparse { | ||
| final double[] queryValues; | ||
| final int[] queryDims; | ||
|
|
||
| // prepare queryVector once per script execution | ||
| // queryVector represents a map of dimensions to values | ||
| public L1NormSparse(Map<String, Number> queryVector) { | ||
| //break vector into two arrays dims and values | ||
| int n = queryVector.size(); | ||
| queryDims = new int[n]; | ||
| queryValues = new double[n]; | ||
| int i = 0; | ||
| for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) { | ||
| try { | ||
| queryDims[i] = Integer.parseInt(dimValue.getKey()); | ||
| } catch (final NumberFormatException e) { | ||
| throw new IllegalArgumentException("Failed to parse a query vector dimension, it must be an integer!", e); | ||
| } | ||
| queryValues[i] = dimValue.getValue().doubleValue(); | ||
| i++; | ||
| } | ||
| // Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions | ||
| sortSparseDimsDoubleValues(queryDims, queryValues, n); | ||
| } | ||
|
|
||
| public double l1normSparse(VectorScriptDocValues.SparseVectorScriptDocValues dvs) { | ||
| BytesRef value = dvs.getEncodedValue(); | ||
| if (value == null) return 0; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: block with brackets |
||
| int[] docDims = VectorEncoderDecoder.decodeSparseVectorDims(value); | ||
| float[] docValues = VectorEncoderDecoder.decodeSparseVector(value); | ||
| int queryIndex = 0; | ||
| int docIndex = 0; | ||
| double l1norm = 0; | ||
| // find common dimensions among vectors v1 and v2 and calculate l1norm based on common dimensions | ||
| while (queryIndex < queryDims.length && docIndex < docDims.length) { | ||
| if (queryDims[queryIndex] == docDims[docIndex]) { | ||
| l1norm += Math.abs(queryValues[queryIndex] - docValues[docIndex]); | ||
| queryIndex++; | ||
| docIndex++; | ||
| } else if (queryDims[queryIndex] > docDims[docIndex]) { | ||
| docIndex++; | ||
| } else { | ||
| queryIndex++; | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have a general question about computing these norms for sparse vectors. The way this is implemented currently it looks like it is completely ignoring values that are not present in either of the two vectors. e.g. if a = [1:1, 2:1] and b = [3:1 4:1] then the result is 0. I would expect it to be 4 though.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cbuescher thanks, this is a great feedback. I will adjust the code accordingly |
||
| } | ||
| return l1norm; | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Calculate l2 norm - Euclidean distance | ||
| * between a query's sparse vector and documents' sparse vectors | ||
| * | ||
| * L2NormSparse is implemented as a class to use | ||
| * painless script caching to prepare queryVector | ||
| * only once per script execution for all documents. | ||
| * A user will call `l2normSparse(params.queryVector, doc['my_vector'])` | ||
| */ | ||
| public static final class L2NormSparse { | ||
| final double[] queryValues; | ||
| final int[] queryDims; | ||
|
|
||
| // prepare queryVector once per script execution | ||
| // queryVector represents a map of dimensions to values | ||
| public L2NormSparse(Map<String, Number> queryVector) { | ||
| //break vector into two arrays dims and values | ||
| int n = queryVector.size(); | ||
| queryDims = new int[n]; | ||
| queryValues = new double[n]; | ||
| int i = 0; | ||
| for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) { | ||
| try { | ||
| queryDims[i] = Integer.parseInt(dimValue.getKey()); | ||
| } catch (final NumberFormatException e) { | ||
| throw new IllegalArgumentException("Failed to parse a query vector dimension, it must be an integer!", e); | ||
| } | ||
| queryValues[i] = dimValue.getValue().doubleValue(); | ||
| i++; | ||
| } | ||
| // Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions | ||
| sortSparseDimsDoubleValues(queryDims, queryValues, n); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks almost identical to the constructor in the above class. Maybe this can be shared to large extents.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cbuescher Thanks, Christoph. Indeed I was also thinking how to share the code. But I am working under 2 constraints here:
Still, I will keep thinking how this code can be restructured.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I didn't thing about performance issues about boxing in lambdas. Does Painless prohibit e.g. L1NormSparse and L2NormSparse sharing a common abstract superclass for e.g. sharing the common code in the constructor? |
||
| } | ||
|
|
||
| public double l2normSparse(VectorScriptDocValues.SparseVectorScriptDocValues dvs) { | ||
| BytesRef value = dvs.getEncodedValue(); | ||
| if (value == null) return 0; | ||
| int[] docDims = VectorEncoderDecoder.decodeSparseVectorDims(value); | ||
| float[] docValues = VectorEncoderDecoder.decodeSparseVector(value); | ||
| int queryIndex = 0; | ||
| int docIndex = 0; | ||
| double l2norm = 0; | ||
| // find common dimensions among vectors v1 and v2 and calculate l1norm based on common dimensions | ||
| while (queryIndex < queryDims.length && docIndex < docDims.length) { | ||
| if (queryDims[queryIndex] == docDims[docIndex]) { | ||
| double diff = queryValues[queryIndex] - docValues[docIndex]; | ||
| l2norm += diff * diff; | ||
| queryIndex++; | ||
| docIndex++; | ||
| } else if (queryDims[queryIndex] > docDims[docIndex]) { | ||
| docIndex++; | ||
| } else { | ||
| queryIndex++; | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same general remarks as above. Also, the implementations of the two norms look very similar with the exception of how the vector diffs are treated (squared vs. just summing them). I think it would be worth trying to share huge parts of the function and maybe only use a differing lambda to do the different calculations. |
||
| } | ||
| return Math.sqrt(l2norm); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Calculate a dot product between a query's sparse vector and documents' sparse vectors | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,8 @@ | |
| import org.elasticsearch.index.query.ScoreScriptUtils.CosineSimilarity; | ||
| import org.elasticsearch.index.query.ScoreScriptUtils.DotProductSparse; | ||
| import org.elasticsearch.index.query.ScoreScriptUtils.CosineSimilaritySparse; | ||
| import org.elasticsearch.index.query.ScoreScriptUtils.L1NormSparse; | ||
| import org.elasticsearch.index.query.ScoreScriptUtils.L2NormSparse; | ||
|
|
||
| import java.util.Arrays; | ||
| import java.util.HashMap; | ||
|
|
@@ -33,6 +35,9 @@ | |
|
|
||
| import static org.elasticsearch.index.mapper.VectorEncoderDecoderTests.mockEncodeDenseVector; | ||
| import static org.elasticsearch.index.query.ScoreScriptUtils.dotProduct; | ||
| import static org.elasticsearch.index.query.ScoreScriptUtils.l1norm; | ||
| import static org.elasticsearch.index.query.ScoreScriptUtils.l2norm; | ||
|
|
||
| import static org.mockito.Mockito.mock; | ||
| import static org.mockito.Mockito.when; | ||
|
|
||
|
|
@@ -53,6 +58,14 @@ public void testDenseVectorFunctions() { | |
| CosineSimilarity cosineSimilarity = new CosineSimilarity(queryVector); | ||
| double result2 = cosineSimilarity.cosineSimilarity(dvs); | ||
| assertEquals("cosineSimilarity result is not equal to the expected value!", 0.78, result2, 0.1); | ||
|
|
||
| // test l1Norm | ||
| double result3 = l1norm(queryVector, dvs); | ||
| assertEquals("l1norm result is not equal to the expected value!", 485.18, result3, 0.1); | ||
|
|
||
| // test l2norm | ||
| double result4 = l2norm(queryVector, dvs); | ||
| assertEquals("l2norm result is not equal to the expected value!", 301.36, result4, 0.1); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to see an additional test for differing vector lengths, probably asserting that this throws an error if we decide to go that route. |
||
| } | ||
|
|
||
| public void testSparseVectorFunctions() { | ||
|
|
@@ -78,5 +91,15 @@ public void testSparseVectorFunctions() { | |
| CosineSimilaritySparse cosineSimilaritySparse = new CosineSimilaritySparse(queryVector); | ||
| double result2 = cosineSimilaritySparse.cosineSimilaritySparse(dvs); | ||
| assertEquals("cosineSimilaritySparse result is not equal to the expected value!", 0.78, result2, 0.1); | ||
|
|
||
| // test l1norm | ||
| L1NormSparse l1Norm = new L1NormSparse(queryVector); | ||
| double result3 = l1Norm.l1normSparse(dvs); | ||
| assertEquals("l1normSparse result is not equal to the expected value!", 485.18, result3, 0.1); | ||
|
|
||
| // test l2norm | ||
| L2NormSparse l2Norm = new L2NormSparse(queryVector); | ||
| double result4 = l2Norm.l2normSparse(dvs); | ||
| assertEquals("l2normSparse result is not equal to the expected value!", 301.36, result4, 0.1); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These tests don't cover the cases mentioned above where the queryVector contains dimensions not present in the document vector and vice versa. |
||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not exactly sure since I'm no native speaker, but I would expect "the
l2normshown below".