Skip to content

Commit 8125168

Browse files
hhbyyhmengxr
authored andcommitted
[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths
JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384 Currently `Vectors.sqdist` return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample PR scope: Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code. For reviewers: Maybe we should first discuss what's the correct behavior. 1. Vectors for sqdist must have the same length, like in breeze? 2. If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?) I'll update PR with more optimization and additional ut afterwards. Thanks. Author: Yuhao Yang <[email protected]> Closes apache#4183 from hhbyyh/fixDouble and squashes the following commits: 1f17328 [Yuhao Yang] limit PR scope to size constraints only 54cbf97 [Yuhao Yang] fix Vectors.sqdist inconsistence
1 parent 8df9435 commit 8125168

File tree

1 file changed

+6
-5
lines changed

1 file changed

+6
-5
lines changed

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -333,29 +333,30 @@ object Vectors {
333333
math.pow(sum, 1.0 / p)
334334
}
335335
}
336-
336+
337337
/**
338338
* Returns the squared distance between two Vectors.
339339
* @param v1 first Vector.
340340
* @param v2 second Vector.
341341
* @return squared distance between two Vectors.
342342
*/
343343
def sqdist(v1: Vector, v2: Vector): Double = {
344+
require(v1.size == v2.size, "vector dimension mismatch")
344345
var squaredDistance = 0.0
345-
(v1, v2) match {
346+
(v1, v2) match {
346347
case (v1: SparseVector, v2: SparseVector) =>
347348
val v1Values = v1.values
348349
val v1Indices = v1.indices
349350
val v2Values = v2.values
350351
val v2Indices = v2.indices
351352
val nnzv1 = v1Indices.size
352353
val nnzv2 = v2Indices.size
353-
354+
354355
var kv1 = 0
355356
var kv2 = 0
356357
while (kv1 < nnzv1 || kv2 < nnzv2) {
357358
var score = 0.0
358-
359+
359360
if (kv2 >= nnzv2 || (kv1 < nnzv1 && v1Indices(kv1) < v2Indices(kv2))) {
360361
score = v1Values(kv1)
361362
kv1 += 1
@@ -397,7 +398,7 @@ object Vectors {
397398
val nnzv1 = indices.size
398399
val nnzv2 = v2.size
399400
var iv1 = if (nnzv1 > 0) indices(kv1) else -1
400-
401+
401402
while (kv2 < nnzv2) {
402403
var score = 0.0
403404
if (kv2 != iv1) {

0 commit comments

Comments
 (0)