[SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z #20686

attilapiros · 2018-02-27T20:07:46Z

What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:

NGramSuite
NormalizerSuite
OneHotEncoderEstimatorSuite
OneHotEncoderSuite
PCASuite
PolynomialExpansionSuite
QuantileDiscretizerSuite
RFormulaSuite
SQLTransformerSuite
StandardScalerSuite
StopWordsRemoverSuite
StringIndexerSuite
TokenizerSuite
RegexTokenizerSuite
VectorAssemblerSuite
VectorIndexerSuite
VectorSizeHintSuite
VectorSlicerSuite
Word2VecSuite

How was this patch tested?

They are unit test.

SparkQA · 2018-02-27T20:47:49Z

Test build #87732 has finished for PR 20686 at commit 4099c85.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class NGramSuite extends MLTest with DefaultReadWriteTest
class NormalizerSuite extends MLTest with DefaultReadWriteTest
class OneHotEncoderEstimatorSuite extends MLTest with DefaultReadWriteTest
class NumericTypeWithEncoder[A](val numericType: NumericType)
class NumericTypeWithEncoder[A](val numericType: NumericType)
class PCASuite extends MLTest with DefaultReadWriteTest
class PolynomialExpansionSuite extends MLTest with DefaultReadWriteTest
class QuantileDiscretizerSuite extends MLTest with DefaultReadWriteTest
class SQLTransformerSuite extends MLTest with DefaultReadWriteTest
class StandardScalerSuite extends MLTest with DefaultReadWriteTest
class StopWordsRemoverSuite extends MLTest with DefaultReadWriteTest
class StringIndexerSuite extends MLTest with DefaultReadWriteTest
class TokenizerSuite extends MLTest with DefaultReadWriteTest
class RegexTokenizerSuite extends MLTest with DefaultReadWriteTest
class VectorIndexerSuite extends MLTest with DefaultReadWriteTest with Logging
class VectorSlicerSuite extends MLTest with DefaultReadWriteTest
class Word2VecSuite extends MLTest with DefaultReadWriteTest

SparkQA · 2018-02-27T22:16:43Z

Test build #87734 has finished for PR 20686 at commit bc7946c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2018-02-28T14:48:49Z

Ignored tests where issues found during streaming:

OneHotEncoderSuite / "input column without ML attribute"
RFormulaSuite / "label column already exists but is not numeric type"
VectorAssemblerSuite / "VectorAssembler"
VectorAssemblerSuite / "ML attributes"

From this problems new jira issues can be created when my PR is accepted.

SparkQA · 2018-02-28T15:47:56Z

Test build #87780 has finished for PR 20686 at commit 836a173.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2018-02-28T17:13:13Z

cc @WeichenXu123

WeichenXu123 · 2018-03-01T03:21:14Z

Thanks! I will help review it later.

WeichenXu123

Partly reviewed. Thanks!

WeichenXu123 · 2018-03-01T10:30:55Z

mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala

+    Vectors.dense(0.6, -1.1, -3.0),
+    Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+    Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+    Vectors.sparse(3, Seq()))


I prefer to put initializing data var into beforeAll.

First of all thanks for the review.

Regarding this specific comment: this way it can be a 'val' and variable name and value is close to each other. What is the advantage of separating them?

I only doubt that when the testsuite object being serialized and then deserialized, the data will lost. But I am not sure which case serialization will occur.

But '@transient' is about to skipping serialization for this field

ok its a minor issue lets ignore it.

I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests.

I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading.

WeichenXu123 · 2018-03-01T10:38:50Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala

+        assert(group.size === 2)
+        assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+        assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+    }


I think for streaming , we don't need to test functions about attributes, so this part just keep old testing code.

I am just wondering whether it is a good idea to revert attribute tests as they are working and checking streaming. Is there any disadvantages keeping them? Can you please go into the details why they are not needed?

Discussed with @jkbradley . Agreed with you.

WeichenXu123 · 2018-03-01T10:39:07Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala

+      assert(group.size === 2)
+      assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("0").withIndex(0))
+      assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("1").withIndex(1))
+    }


WeichenXu123 · 2018-03-01T12:25:52Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala

+      new NumericTypeWithEncoder[Float](FloatType),
+      new NumericTypeWithEncoder[Byte](ByteType),
+      new NumericTypeWithEncoder[Double](DoubleType),
+      new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder()))


Why not use Seq(ShortType, LongType, ...) ?

The reason behind is we cannot pass runtime values (ShortType, LongType, ...) as a generic parameter to the function testTransformer. But luckily context bounds are resolved to an implicit parameter this is the t.encoder which passed as a last parameter.

Oh I see. This is an syntax issue that testTransformer need generic parameter. When I design the testTransformer helper function, I cannot eliminate the generic parameter which make things difficult.

WeichenXu123 · 2018-03-01T12:28:14Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala

+      testTransformer(dfWithTypes, model, "output", "expected") {
+        case Row(output: Vector, expected: Vector) =>
+          assert(output === expected)
+      }(t.encoder)


Why need (t.encoder) here ?

See previous comment.

WeichenXu123 · 2018-03-01T12:35:52Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala

+      new NumericTypeWithEncoder[Float](FloatType),
+      new NumericTypeWithEncoder[Byte](ByteType),
+      new NumericTypeWithEncoder[Double](DoubleType),
+      new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder()))


WeichenXu123 · 2018-03-01T12:42:11Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

+      (1.0, 4.0, 9.0),
+      (1.0, 4.0, 9.0)
+      ).toDF("result1", "result2", "result3")
+        .collect().toSeq


What about use:

val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect()

So that avoid hardcoding the big array.

I think having a bit bigger array in the test is better then checking df.transform result against df.transform result (as the function testTransform uses df.transform for DF tests).

But I prefer to avoid hardcoding big literal array so that the code is easier for maintenance. and following code is enough I think:

val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect() testTransformerByGlobalCheckFunc[(Double, Double, Double)]( df, plForMultiCols "result1", "result2","result3") { rows =>assert(rows == expected) }

There is a similar case here #20121 (comment)

ok I change it soon

WeichenXu123 · 2018-03-01T12:42:34Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

+      (9.0, 9.0, 9.0),
+      (9.0, 9.0, 9.0),
+      (9.0, 9.0, 9.0)
+    ).toDF("result1", "result2", "result3")


WeichenXu123

partly reviewed (reach StandardScalerSuite)

WeichenXu123 · 2018-03-02T03:22:19Z

mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala

+    Vectors.dense(0.6, -1.1, -3.0),
+    Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+    Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+    Vectors.sparse(3, Seq()))


I only doubt that when the testsuite object being serialized and then deserialized, the data will lost. But I am not sure which case serialization will occur.

WeichenXu123 · 2018-03-02T03:22:49Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala

+        assert(group.size === 2)
+        assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+        assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+    }


Discussed with @jkbradley . Agreed with you.

WeichenXu123 · 2018-03-02T03:46:45Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala

+      new NumericTypeWithEncoder[Float](FloatType),
+      new NumericTypeWithEncoder[Byte](ByteType),
+      new NumericTypeWithEncoder[Double](DoubleType),
+      new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder()))


Oh I see. This is an syntax issue that testTransformer need generic parameter. When I design the testTransformer helper function, I cannot eliminate the generic parameter which make things difficult.

WeichenXu123 · 2018-03-02T03:54:25Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

+      for (expectedAttributeGroup <- expectedAttributes) {
+        val attributeGroup =
+          AttributeGroup.fromStructField(rows.head.schema(expectedAttributeGroup.name))
+        assert(attributeGroup == expectedAttributeGroup)


Should we use === instead ?

WeichenXu123 · 2018-03-02T04:09:59Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

+      expected: DataFrame,
+      expectedAttributes: AttributeGroup*): Unit = {
+    val resultSchema = formulaModel.transformSchema(dataframe.schema)
+    assert(resultSchema.json == expected.schema.json)


You compare schema.json instead of schema.toString. Are you sure they have the same effect ?

I know they are different but 'schema.json' based compare is more restrictive: it contains the metadata as well.

WeichenXu123 · 2018-03-02T04:14:12Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

        ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
    ).toDF("id", "a", "b", "features", "label")
-    // assert(result.schema.toString == resultSchema.toString)
+      .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))


nit: indent

It was at the level of val +2 extra spaces. Should I indent the dots to the same row?

Thanks for the help.

I think maybe

... ).toDF("id", "a", "b", "features", "label") .select($"id", ...

looks beautiful.

I am sorry to spending time with this issue but I would like to be consistent and keep the rules so what about the following:

... ) .toDF("id", "a", "b", "features", "label") .select($"id", ...

So all indented by two spaces and the dots are aligned. Could you accept this?

I am also confused about the align rule. @jkbradley what do you think ?

I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib.

WeichenXu123 · 2018-03-02T04:14:38Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

+      (1, Vectors.dense(0.0, 1.0), Vectors.dense(0.0, 1.0), 1.0),
+      (2, Vectors.dense(1.0, 2.0), Vectors.dense(1.0, 2.0), 2.0)
+    ).toDF("id", "vec2", "features", "label")
+      .select($"id", $"vec2".as("vec2", metadata), $"features", $"label")


nit: indent

WeichenXu123 · 2018-03-02T04:15:29Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

      (1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0),
      (2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0)
    ).toDF("id", "a", "b", "features", "label")
+      .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))


nit: indent

WeichenXu123 · 2018-03-02T04:15:34Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

      (2, "bar", "zq", Vectors.dense(1.0, 0.0, 2.0), 0.0),
      (3, "bar", "zy", Vectors.dense(1.0, 0.0, 3.0), 2.0)
    ).toDF("id", "a", "b", "features", "label")
+      .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))


nit: indent

WeichenXu123 · 2018-03-02T04:19:04Z

mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala

+      df,
+      sqlTrans,
+      "id1") { rows =>
+      assert(df.storageLevel != StorageLevel.NONE)


Move assert(df.storageLevel != StorageLevel.NONE) to here seems meaningless, because you do not use rows parameter.

Thanks, I change it.

WeichenXu123

Review done. Thanks for updating so many testing code!

WeichenXu123 · 2018-03-02T08:46:16Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+    expectedMessagePart : String,
+    firstResultCol: String) {
+
+    def hasExpectedMessage(exception: Throwable): Boolean =


I doubt whether the check here is too strict. It require exactly match message so when some class modify the exception message then many testcase will fail.
Or can we just check the exception type ?

It uses contains. I would keep this behaviour as the test is more well spoken this way.
@jkbradley?

Since most other tests check parts of the message, I'm OK with this setup. When we don't think the message will remain stable, we can pass an empty string for expectedMessagePart.

Just curious: Did you have to add the getCause case because of streaming throwing wrapped exceptions?

Yes, that was the reason.

SparkQA · 2018-03-02T21:23:52Z

Test build #87904 has finished for PR 20686 at commit 4944c62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-06T19:47:20Z

Test build #88023 has finished for PR 20686 at commit 7a14154.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-03-09T20:06:36Z

Thanks for the PR @attilapiros and @WeichenXu123 for the review! I'll take a look now.

jkbradley

Running out of time to do a complete review now, but I'll leave initial comments and continue later.

jkbradley · 2018-03-09T20:06:39Z

mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala

      .setInputCol("inputTokens")
      .setOutputCol("nGrams")
-    val dataset = Seq(NGramTestData(
+    val dataFrame = Seq(NGramTestData(


These kinds of changes are not necessary and make the PR a lot longer. Would you mind reverting them?

jkbradley · 2018-03-09T20:08:38Z

mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala

+    Vectors.dense(0.6, -1.1, -3.0),
+    Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+    Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+    Vectors.sparse(3, Seq()))


I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests.

I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading.

jkbradley · 2018-03-09T20:13:27Z

mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala


-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _


I will say, though, that I'm happy with moving Normalizer into individual tests. It's weird how it is shared here since it's mutated within tests.

jkbradley · 2018-03-09T22:13:41Z

I'll do a complete review now!

SparkQA · 2018-03-09T22:45:05Z

Test build #88137 has finished for PR 20686 at commit 80b9c8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-09T23:26:06Z

Test build #88139 has finished for PR 20686 at commit a5375bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the updates! I finished a detailed pass.

jkbradley · 2018-03-09T22:13:33Z

mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala


-  test("input column without ML attribute") {
+
+  ignore("input column without ML attribute") {


Let's keep the test but limit it to batch. People should switch to OneHotEncoderEstimator anyways.

jkbradley · 2018-03-09T22:28:35Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

-        assert(rowForSingle.getDouble(0) == rowForMultiCols.getDouble(0) &&
-          rowForSingle.getDouble(1) == rowForMultiCols.getDouble(1) &&
-          rowForSingle.getDouble(2) == rowForMultiCols.getDouble(2))
+    testTransformerByGlobalCheckFunc[(Double, Double, Double)](


I'd remove this. Testing vs. multiCol is already testing batch vs streaming. No need to test singleCol against itself.

jkbradley · 2018-03-09T22:29:03Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

+      assert(rows === expected)
+    }
+
+    testTransformerByGlobalCheckFunc[(Double, Double, Double)](


Is this a repeat of the test just above?

jkbradley · 2018-03-09T22:44:22Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

        ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
    ).toDF("id", "a", "b", "features", "label")
-    // assert(result.schema.toString == resultSchema.toString)
+      .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))


I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib.

jkbradley · 2018-03-09T23:17:56Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

+      assert(attrSkip.values.get === Array("b", "a"))
+      // Verify that we skip the c record
+      // a -> 1, b -> 0
+      val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF()


This can be moved outside of the testTransformerByGlobalCheckFunc method.

jkbradley · 2018-03-10T00:22:14Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala

  }

-  test("ML attributes") {
+  ignore("ML attributes") {


ditto: do not ignore

jkbradley · 2018-03-10T00:22:51Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala


-    model.transform(densePoints1) // should work
-    model.transform(sparsePoints1) // should work
+    // should work


We can remove "should work" comments : P

jkbradley · 2018-03-10T00:23:16Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala

-    intercept[AssertionError] {
-      model.transform(densePoints2).collect()
-      logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
+    withClue("Did not found expected error message when fit, " +


found -> find

Or just use the original text

jkbradley · 2018-03-10T00:24:26Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala

-    intercept[SparkException] {
-      model.transform(densePoints2.repartition(2)).collect()
-      logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
+    withClue("Did not found expected error message when fit, " +


jkbradley · 2018-03-12T17:15:47Z

mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala


    vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty)
-    validateResults(vectorSlicer.transform(df))
+    testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")(


Avoid using a global check function when you don't need to. It'd be better to use testTransformer() since the test is per-row.

The reason I have chosen the global check function is the checks for the attributes:

val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result")) val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected")) assert(resultMetadata.numAttributes === expectedMetadata.numAttributes) resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) => assert(a === b) }

This is part is not row based but more like result set based.

I see, makes sense.

SparkQA · 2018-03-13T20:37:31Z

Test build #88214 has finished for PR 20686 at commit bf713b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-03-15T01:35:08Z

Thanks for the updates & all the work this PR took @attilapiros and for the review @WeichenXu123 !
LGTM
Merging with master

jkbradley · 2018-03-15T01:36:18Z

Merging to branch-2.3 too

# What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <[email protected]> Closes #20686 from attilapiros/SPARK-22915. (cherry picked from commit 279b3db) Signed-off-by: Joseph K. Bradley <[email protected]>

dongjoon-hyun · 2018-03-18T03:06:28Z

Hi, @jkbradley , @WeichenXu123 , @attilapiros .

This seems to break branch-2.3 for three days. Could you take a look please?

attilapiros · 2018-03-18T03:14:24Z

I am checking.

attilapiros · 2018-03-18T04:02:49Z

@dongjoon-hyun it seams to me on 2.3 during streaming if an exception happens within an ml feature then the feature generated message is not at the direct caused by exception but even one level deeper. If I am right I can create a separate PR for 2.3 with the correction. To be continued...

dongjoon-hyun · 2018-03-18T04:13:08Z

Thanks, @attilapiros . You can test your PR if you put [BRANCH-2.3] into your title.

attilapiros · 2018-03-18T05:12:24Z

@dongjoon-hyun The jira is SPARK-23728 and the PR is #20852. Could you please enable Jenkins on that?

dongjoon-hyun · 2018-03-18T23:11:34Z

Thanks. For branch testing, I was confused with maven testing. Since you create a PR against branch-2.3, it looks okay.

# What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <[email protected]> Closes apache#20686 from attilapiros/SPARK-22915.

# What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <[email protected]> Closes apache#20686 from attilapiros/SPARK-22915. (cherry picked from commit 279b3db) Signed-off-by: Joseph K. Bradley <[email protected]>

initial upload

4099c85

fix MLTest failure

bc7946c

Add VectorAssemblerSuite

836a173

WeichenXu123 reviewed Mar 1, 2018

View reviewed changes

WeichenXu123 reviewed Mar 2, 2018

View reviewed changes

applying review comments

4944c62

avoid long hardcoded expected value

7a14154

jkbradley reviewed Mar 9, 2018

View reviewed changes

Applying review comments.

80b9c8b

Applying review comments.

a5375bc

jkbradley reviewed Mar 12, 2018

View reviewed changes

applying review comments

bf713b5

asfgit closed this in 279b3db Mar 15, 2018

attilapiros deleted the SPARK-22915 branch April 26, 2018 20:07


		test("input column without ML attribute") {

		ignore("input column without ML attribute") {

[SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z #20686

[SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z #20686

Uh oh!

Conversation

attilapiros commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

attilapiros commented Feb 28, 2018

Uh oh!

SparkQA commented Feb 28, 2018

Uh oh!

attilapiros commented Feb 28, 2018

Uh oh!

WeichenXu123 commented Mar 1, 2018

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

attilapiros commented Feb 27, 2018 •

edited

Loading

attilapiros Mar 2, 2018 •

edited

Loading

attilapiros Mar 1, 2018 •

edited

Loading

WeichenXu123 Mar 6, 2018 •

edited

Loading

attilapiros Mar 6, 2018 •

edited

Loading

attilapiros Mar 2, 2018 •

edited

Loading