Skip to content

Conversation

@attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Feb 27, 2018

What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:

  • NGramSuite
  • NormalizerSuite
  • OneHotEncoderEstimatorSuite
  • OneHotEncoderSuite
  • PCASuite
  • PolynomialExpansionSuite
  • QuantileDiscretizerSuite
  • RFormulaSuite
  • SQLTransformerSuite
  • StandardScalerSuite
  • StopWordsRemoverSuite
  • StringIndexerSuite
  • TokenizerSuite
  • RegexTokenizerSuite
  • VectorAssemblerSuite
  • VectorIndexerSuite
  • VectorSizeHintSuite
  • VectorSlicerSuite
  • Word2VecSuite

How was this patch tested?

They are unit test.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87732 has finished for PR 20686 at commit 4099c85.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class NGramSuite extends MLTest with DefaultReadWriteTest
  • class NormalizerSuite extends MLTest with DefaultReadWriteTest
  • class OneHotEncoderEstimatorSuite extends MLTest with DefaultReadWriteTest
  • class NumericTypeWithEncoder[A](val numericType: NumericType)
  • class NumericTypeWithEncoder[A](val numericType: NumericType)
  • class PCASuite extends MLTest with DefaultReadWriteTest
  • class PolynomialExpansionSuite extends MLTest with DefaultReadWriteTest
  • class QuantileDiscretizerSuite extends MLTest with DefaultReadWriteTest
  • class SQLTransformerSuite extends MLTest with DefaultReadWriteTest
  • class StandardScalerSuite extends MLTest with DefaultReadWriteTest
  • class StopWordsRemoverSuite extends MLTest with DefaultReadWriteTest
  • class StringIndexerSuite extends MLTest with DefaultReadWriteTest
  • class TokenizerSuite extends MLTest with DefaultReadWriteTest
  • class RegexTokenizerSuite extends MLTest with DefaultReadWriteTest
  • class VectorIndexerSuite extends MLTest with DefaultReadWriteTest with Logging
  • class VectorSlicerSuite extends MLTest with DefaultReadWriteTest
  • class Word2VecSuite extends MLTest with DefaultReadWriteTest

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87734 has finished for PR 20686 at commit bc7946c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

Ignored tests where issues found during streaming:

  • OneHotEncoderSuite / "input column without ML attribute"
  • RFormulaSuite / "label column already exists but is not numeric type"
  • VectorAssemblerSuite / "VectorAssembler"
  • VectorAssemblerSuite / "ML attributes"

From this problems new jira issues can be created when my PR is accepted.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87780 has finished for PR 20686 at commit 836a173.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

cc @WeichenXu123

@WeichenXu123
Copy link
Contributor

Thanks! I will help review it later.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly reviewed. Thanks!

Vectors.dense(0.6, -1.1, -3.0),
Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
Vectors.sparse(3, Seq()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to put initializing data var into beforeAll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks for the review.

Regarding this specific comment: this way it can be a 'val' and variable name and value is close to each other. What is the advantage of separating them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only doubt that when the testsuite object being serialized and then deserialized, the data will lost. But I am not sure which case serialization will occur.

Copy link
Contributor Author

@attilapiros attilapiros Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But '@transient' is about to skipping serialization for this field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok its a minor issue lets ignore it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests.

I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading.

assert(group.size === 2)
assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0))
assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for streaming , we don't need to test functions about attributes, so this part just keep old testing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering whether it is a good idea to revert attribute tests as they are working and checking streaming. Is there any disadvantages keeping them? Can you please go into the details why they are not needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @jkbradley . Agreed with you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

assert(group.size === 2)
assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("0").withIndex(0))
assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("1").withIndex(1))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

new NumericTypeWithEncoder[Float](FloatType),
new NumericTypeWithEncoder[Byte](ByteType),
new NumericTypeWithEncoder[Double](DoubleType),
new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Seq(ShortType, LongType, ...) ?

Copy link
Contributor Author

@attilapiros attilapiros Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason behind is we cannot pass runtime values (ShortType, LongType, ...) as a generic parameter to the function testTransformer. But luckily context bounds are resolved to an implicit parameter this is the t.encoder which passed as a last parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. This is an syntax issue that testTransformer need generic parameter. When I design the testTransformer helper function, I cannot eliminate the generic parameter which make things difficult.

testTransformer(dfWithTypes, model, "output", "expected") {
case Row(output: Vector, expected: Vector) =>
assert(output === expected)
}(t.encoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need (t.encoder) here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

new NumericTypeWithEncoder[Float](FloatType),
new NumericTypeWithEncoder[Byte](ByteType),
new NumericTypeWithEncoder[Double](DoubleType),
new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

(1.0, 4.0, 9.0),
(1.0, 4.0, 9.0)
).toDF("result1", "result2", "result3")
.collect().toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about use:

val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect()

So that avoid hardcoding the big array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a bit bigger array in the test is better then checking df.transform result against df.transform result (as the function testTransform uses df.transform for DF tests).

Copy link
Contributor

@WeichenXu123 WeichenXu123 Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I prefer to avoid hardcoding big literal array so that the code is easier for maintenance. and following code is enough I think:

val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect()
testTransformerByGlobalCheckFunc[(Double, Double, Double)](
   df, plForMultiCols
  "result1", "result2","result3") { 
      rows =>assert(rows == expected)
   }

There is a similar case here #20121 (comment)

Copy link
Contributor Author

@attilapiros attilapiros Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I change it soon

(9.0, 9.0, 9.0),
(9.0, 9.0, 9.0),
(9.0, 9.0, 9.0)
).toDF("result1", "result2", "result3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partly reviewed (reach StandardScalerSuite)

Vectors.dense(0.6, -1.1, -3.0),
Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
Vectors.sparse(3, Seq()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only doubt that when the testsuite object being serialized and then deserialized, the data will lost. But I am not sure which case serialization will occur.

assert(group.size === 2)
assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0))
assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @jkbradley . Agreed with you.

new NumericTypeWithEncoder[Float](FloatType),
new NumericTypeWithEncoder[Byte](ByteType),
new NumericTypeWithEncoder[Double](DoubleType),
new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. This is an syntax issue that testTransformer need generic parameter. When I design the testTransformer helper function, I cannot eliminate the generic parameter which make things difficult.

for (expectedAttributeGroup <- expectedAttributes) {
val attributeGroup =
AttributeGroup.fromStructField(rows.head.schema(expectedAttributeGroup.name))
assert(attributeGroup == expectedAttributeGroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use === instead ?

expected: DataFrame,
expectedAttributes: AttributeGroup*): Unit = {
val resultSchema = formulaModel.transformSchema(dataframe.schema)
assert(resultSchema.json == expected.schema.json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You compare schema.json instead of schema.toString. Are you sure they have the same effect ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know they are different but 'schema.json' based compare is more restrictive: it contains the metadata as well.

("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
).toDF("id", "a", "b", "features", "label")
// assert(result.schema.toString == resultSchema.toString)
.select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

Copy link
Contributor Author

@attilapiros attilapiros Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was at the level of val +2 extra spaces. Should I indent the dots to the same row?

Thanks for the help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe

...
).toDF("id", "a", "b", "features", "label")
 .select($"id", ...

looks beautiful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry to spending time with this issue but I would like to be consistent and keep the rules so what about the following:

...
)
  .toDF("id", "a", "b", "features", "label")
  .select($"id", ...

So all indented by two spaces and the dots are aligned. Could you accept this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also confused about the align rule. @jkbradley what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib.

(1, Vectors.dense(0.0, 1.0), Vectors.dense(0.0, 1.0), 1.0),
(2, Vectors.dense(1.0, 2.0), Vectors.dense(1.0, 2.0), 2.0)
).toDF("id", "vec2", "features", "label")
.select($"id", $"vec2".as("vec2", metadata), $"features", $"label")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

(1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0),
(2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0)
).toDF("id", "a", "b", "features", "label")
.select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

(2, "bar", "zq", Vectors.dense(1.0, 0.0, 2.0), 0.0),
(3, "bar", "zy", Vectors.dense(1.0, 0.0, 3.0), 2.0)
).toDF("id", "a", "b", "features", "label")
.select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

df,
sqlTrans,
"id1") { rows =>
assert(df.storageLevel != StorageLevel.NONE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move assert(df.storageLevel != StorageLevel.NONE) to here seems meaningless, because you do not use rows parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I change it.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review done. Thanks for updating so many testing code!

expectedMessagePart : String,
firstResultCol: String) {

def hasExpectedMessage(exception: Throwable): Boolean =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt whether the check here is too strict. It require exactly match message so when some class modify the exception message then many testcase will fail.
Or can we just check the exception type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses contains. I would keep this behaviour as the test is more well spoken this way.
@jkbradley?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most other tests check parts of the message, I'm OK with this setup. When we don't think the message will remain stable, we can pass an empty string for expectedMessagePart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Did you have to add the getCause case because of streaming throwing wrapped exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the reason.

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87904 has finished for PR 20686 at commit 4944c62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2018

Test build #88023 has finished for PR 20686 at commit 7a14154.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks for the PR @attilapiros and @WeichenXu123 for the review! I'll take a look now.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running out of time to do a complete review now, but I'll leave initial comments and continue later.

.setInputCol("inputTokens")
.setOutputCol("nGrams")
val dataset = Seq(NGramTestData(
val dataFrame = Seq(NGramTestData(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of changes are not necessary and make the PR a lot longer. Would you mind reverting them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Vectors.dense(0.6, -1.1, -3.0),
Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
Vectors.sparse(3, Seq()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests.

I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading.


@transient var data: Array[Vector] = _
@transient var dataFrame: DataFrame = _
@transient var normalizer: Normalizer = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say, though, that I'm happy with moving Normalizer into individual tests. It's weird how it is shared here since it's mutated within tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jkbradley
Copy link
Member

I'll do a complete review now!

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88137 has finished for PR 20686 at commit 80b9c8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88139 has finished for PR 20686 at commit a5375bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I finished a detailed pass.


test("input column without ML attribute") {

ignore("input column without ML attribute") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the test but limit it to batch. People should switch to OneHotEncoderEstimator anyways.

assert(rowForSingle.getDouble(0) == rowForMultiCols.getDouble(0) &&
rowForSingle.getDouble(1) == rowForMultiCols.getDouble(1) &&
rowForSingle.getDouble(2) == rowForMultiCols.getDouble(2))
testTransformerByGlobalCheckFunc[(Double, Double, Double)](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this. Testing vs. multiCol is already testing batch vs streaming. No need to test singleCol against itself.

assert(rows === expected)
}

testTransformerByGlobalCheckFunc[(Double, Double, Double)](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a repeat of the test just above?

("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
).toDF("id", "a", "b", "features", "label")
// assert(result.schema.toString == resultSchema.toString)
.select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib.

assert(attrSkip.values.get === Array("b", "a"))
// Verify that we skip the c record
// a -> 1, b -> 0
val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved outside of the testTransformerByGlobalCheckFunc method.

}

test("ML attributes") {
ignore("ML attributes") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: do not ignore


model.transform(densePoints1) // should work
model.transform(sparsePoints1) // should work
// should work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove "should work" comments : P

intercept[AssertionError] {
model.transform(densePoints2).collect()
logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
withClue("Did not found expected error message when fit, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found -> find

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just use the original text

intercept[SparkException] {
model.transform(densePoints2.repartition(2)).collect()
logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
withClue("Did not found expected error message when fit, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty)
validateResults(vectorSlicer.transform(df))
testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using a global check function when you don't need to. It'd be better to use testTransformer() since the test is per-row.

Copy link
Contributor Author

@attilapiros attilapiros Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I have chosen the global check function is the checks for the attributes:

      val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result"))
      val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected"))
      assert(resultMetadata.numAttributes === expectedMetadata.numAttributes)
      resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) =>
        assert(a === b)
      }

This is part is not row based but more like result set based.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense.

@SparkQA
Copy link

SparkQA commented Mar 13, 2018

Test build #88214 has finished for PR 20686 at commit bf713b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks for the updates & all the work this PR took @attilapiros and for the review @WeichenXu123 !
LGTM
Merging with master

@jkbradley
Copy link
Member

Merging to branch-2.3 too

asfgit pushed a commit that referenced this pull request Mar 15, 2018
# What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:

- NGramSuite
- NormalizerSuite
- OneHotEncoderEstimatorSuite
- OneHotEncoderSuite
- PCASuite
- PolynomialExpansionSuite
- QuantileDiscretizerSuite
- RFormulaSuite
- SQLTransformerSuite
- StandardScalerSuite
- StopWordsRemoverSuite
- StringIndexerSuite
- TokenizerSuite
- RegexTokenizerSuite
- VectorAssemblerSuite
- VectorIndexerSuite
- VectorSizeHintSuite
- VectorSlicerSuite
- Word2VecSuite

# How was this patch tested?

They are unit test.

Author: “attilapiros” <[email protected]>

Closes #20686 from attilapiros/SPARK-22915.

(cherry picked from commit 279b3db)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 279b3db Mar 15, 2018
@attilapiros
Copy link
Contributor Author

I am checking.

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun it seams to me on 2.3 during streaming if an exception happens within an ml feature then the feature generated message is not at the direct caused by exception but even one level deeper. If I am right I can create a separate PR for 2.3 with the correction. To be continued...

@dongjoon-hyun
Copy link
Member

Thanks, @attilapiros . You can test your PR if you put [BRANCH-2.3] into your title.

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun The jira is SPARK-23728 and the PR is #20852. Could you please enable Jenkins on that?

@dongjoon-hyun
Copy link
Member

Thanks. For branch testing, I was confused with maven testing. Since you create a PR against branch-2.3, it looks okay.

mstewart141 pushed a commit to mstewart141/spark that referenced this pull request Mar 24, 2018
# What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:

- NGramSuite
- NormalizerSuite
- OneHotEncoderEstimatorSuite
- OneHotEncoderSuite
- PCASuite
- PolynomialExpansionSuite
- QuantileDiscretizerSuite
- RFormulaSuite
- SQLTransformerSuite
- StandardScalerSuite
- StopWordsRemoverSuite
- StringIndexerSuite
- TokenizerSuite
- RegexTokenizerSuite
- VectorAssemblerSuite
- VectorIndexerSuite
- VectorSizeHintSuite
- VectorSlicerSuite
- Word2VecSuite

# How was this patch tested?

They are unit test.

Author: “attilapiros” <[email protected]>

Closes apache#20686 from attilapiros/SPARK-22915.
@attilapiros attilapiros deleted the SPARK-22915 branch April 26, 2018 20:07
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
# What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:

- NGramSuite
- NormalizerSuite
- OneHotEncoderEstimatorSuite
- OneHotEncoderSuite
- PCASuite
- PolynomialExpansionSuite
- QuantileDiscretizerSuite
- RFormulaSuite
- SQLTransformerSuite
- StandardScalerSuite
- StopWordsRemoverSuite
- StringIndexerSuite
- TokenizerSuite
- RegexTokenizerSuite
- VectorAssemblerSuite
- VectorIndexerSuite
- VectorSizeHintSuite
- VectorSlicerSuite
- Word2VecSuite

# How was this patch tested?

They are unit test.

Author: “attilapiros” <[email protected]>

Closes apache#20686 from attilapiros/SPARK-22915.

(cherry picked from commit 279b3db)
Signed-off-by: Joseph K. Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants