Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,11 @@ object LinearDataGenerator {
nPoints: Int,
seed: Int,
eps: Double = 0.1): Seq[LabeledPoint] = {
generateLinearInput(intercept, weights,
Array.fill[Double](weights.length)(0.0),
Array.fill[Double](weights.length)(1.0 / 3.0),
nPoints, seed, eps)}
generateLinearInput(intercept, weights, Array.fill[Double](weights.length)(0.0),
Array.fill[Double](weights.length)(1.0 / 3.0), nPoints, seed, eps)
}

/**
*
* @param intercept Data intercept
* @param weights Weights to be applied.
* @param xMean the mean of the generated features. Lots of time, if the features are not properly
Expand All @@ -104,24 +102,66 @@ object LinearDataGenerator {
nPoints: Int,
seed: Int,
eps: Double): Seq[LabeledPoint] = {
generateLinearInput(intercept, weights, xMean, xVariance, nPoints, seed, eps, 0.0)
}


/**
* @param intercept Data intercept
* @param weights Weights to be applied.
* @param xMean the mean of the generated features. Lots of time, if the features are not properly
* standardized, the algorithm with poor implementation will have difficulty
* to converge.
* @param xVariance the variance of the generated features.
* @param nPoints Number of points in sample.
* @param seed Random seed
* @param eps Epsilon scaling factor.
* @param sparsity The ratio of zero elements. If it is 0.0, LabeledPoints with
* DenseVector is returned.
* @return Seq of input.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about consolidate with LinearDataGenerator, and add sparsity = 1.0 as param to control if it's sparse feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought it is good idea. But LinearDataGenerator is used as static object, then we have to pass sparsity as parameter to generateLinearInput. This method seems to be used a lot of suites. It is necessary to change a lot of method reference.
Therefore it might be better to do in separate JIRA. What do you thing about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's modify the JIRA and do it here. Basically, you can create a LinearDataGenerator with old signature calling new API for compatibility issue.

@Since("1.6.0")
def generateLinearInput(
intercept: Double,
weights: Array[Double],
xMean: Array[Double],
xVariance: Array[Double],
nPoints: Int,
seed: Int,
eps: Double,
sparsity: Double): Seq[LabeledPoint] = {
require(0.0 <= sparsity && sparsity <= 1.0)
val rnd = new Random(seed)
val x = Array.fill[Array[Double]](nPoints)(
Array.fill[Double](weights.length)(rnd.nextDouble()))

val sparseRnd = new Random(seed)
x.foreach { v =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have sparsity, randomly choose n = numFeatures * (1 - sparsity) as non-zero features, and zero the rest out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add the variance of sparsity such that the num of non zeros will not be constant.

var i = 0
val len = v.length
while (i < len) {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
if (sparseRnd.nextDouble() < sparsity) {
v(i) = 0.0
} else {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
}
i += 1
}
}

val y = x.map { xi =>
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian()
}
y.zip(x).map(p => LabeledPoint(p._1, Vectors.dense(p._2)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the following code, do

y.zip(x).map { p => 
  if (sparsity == 0.0) {
    LabeledPoint(p._1, Vectors.dense(p._2))
  } else {
    LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
  }
}

y.zip(x).map { p =>
if (sparsity == 0.0) {
// Return LabeledPoints with DenseVector
LabeledPoint(p._1, Vectors.dense(p._2))
} else {
// Return LabeledPoints with SparseVector
LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
}
}
}

/**
Expand Down
Loading