-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z #20686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #87732 has finished for PR 20686 at commit
|
|
Test build #87734 has finished for PR 20686 at commit
|
|
Ignored tests where issues found during streaming:
From this problems new jira issues can be created when my PR is accepted. |
|
Test build #87780 has finished for PR 20686 at commit
|
|
Thanks! I will help review it later. |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partly reviewed. Thanks!
| Vectors.dense(0.6, -1.1, -3.0), | ||
| Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), | ||
| Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), | ||
| Vectors.sparse(3, Seq())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to put initializing data var into beforeAll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thanks for the review.
Regarding this specific comment: this way it can be a 'val' and variable name and value is close to each other. What is the advantage of separating them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only doubt that when the testsuite object being serialized and then deserialized, the data will lost. But I am not sure which case serialization will occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But '@transient' is about to skipping serialization for this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok its a minor issue lets ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests.
I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading.
| assert(group.size === 2) | ||
| assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) | ||
| assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for streaming , we don't need to test functions about attributes, so this part just keep old testing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just wondering whether it is a good idea to revert attribute tests as they are working and checking streaming. Is there any disadvantages keeping them? Can you please go into the details why they are not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @jkbradley . Agreed with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| assert(group.size === 2) | ||
| assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("0").withIndex(0)) | ||
| assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("1").withIndex(1)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| new NumericTypeWithEncoder[Float](FloatType), | ||
| new NumericTypeWithEncoder[Byte](ByteType), | ||
| new NumericTypeWithEncoder[Double](DoubleType), | ||
| new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use Seq(ShortType, LongType, ...) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason behind is we cannot pass runtime values (ShortType, LongType, ...) as a generic parameter to the function testTransformer. But luckily context bounds are resolved to an implicit parameter this is the t.encoder which passed as a last parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. This is an syntax issue that testTransformer need generic parameter. When I design the testTransformer helper function, I cannot eliminate the generic parameter which make things difficult.
| testTransformer(dfWithTypes, model, "output", "expected") { | ||
| case Row(output: Vector, expected: Vector) => | ||
| assert(output === expected) | ||
| }(t.encoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need (t.encoder) here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
| new NumericTypeWithEncoder[Float](FloatType), | ||
| new NumericTypeWithEncoder[Byte](ByteType), | ||
| new NumericTypeWithEncoder[Double](DoubleType), | ||
| new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| (1.0, 4.0, 9.0), | ||
| (1.0, 4.0, 9.0) | ||
| ).toDF("result1", "result2", "result3") | ||
| .collect().toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about use:
val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect()
So that avoid hardcoding the big array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a bit bigger array in the test is better then checking df.transform result against df.transform result (as the function testTransform uses df.transform for DF tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I prefer to avoid hardcoding big literal array so that the code is easier for maintenance. and following code is enough I think:
val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect()
testTransformerByGlobalCheckFunc[(Double, Double, Double)](
df, plForMultiCols
"result1", "result2","result3") {
rows =>assert(rows == expected)
}
There is a similar case here #20121 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I change it soon
| (9.0, 9.0, 9.0), | ||
| (9.0, 9.0, 9.0), | ||
| (9.0, 9.0, 9.0) | ||
| ).toDF("result1", "result2", "result3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partly reviewed (reach StandardScalerSuite)
| Vectors.dense(0.6, -1.1, -3.0), | ||
| Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), | ||
| Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), | ||
| Vectors.sparse(3, Seq())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only doubt that when the testsuite object being serialized and then deserialized, the data will lost. But I am not sure which case serialization will occur.
| assert(group.size === 2) | ||
| assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) | ||
| assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @jkbradley . Agreed with you.
| new NumericTypeWithEncoder[Float](FloatType), | ||
| new NumericTypeWithEncoder[Byte](ByteType), | ||
| new NumericTypeWithEncoder[Double](DoubleType), | ||
| new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. This is an syntax issue that testTransformer need generic parameter. When I design the testTransformer helper function, I cannot eliminate the generic parameter which make things difficult.
| for (expectedAttributeGroup <- expectedAttributes) { | ||
| val attributeGroup = | ||
| AttributeGroup.fromStructField(rows.head.schema(expectedAttributeGroup.name)) | ||
| assert(attributeGroup == expectedAttributeGroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use === instead ?
| expected: DataFrame, | ||
| expectedAttributes: AttributeGroup*): Unit = { | ||
| val resultSchema = formulaModel.transformSchema(dataframe.schema) | ||
| assert(resultSchema.json == expected.schema.json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You compare schema.json instead of schema.toString. Are you sure they have the same effect ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know they are different but 'schema.json' based compare is more restrictive: it contains the metadata as well.
| ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) | ||
| ).toDF("id", "a", "b", "features", "label") | ||
| // assert(result.schema.toString == resultSchema.toString) | ||
| .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was at the level of val +2 extra spaces. Should I indent the dots to the same row?
Thanks for the help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe
...
).toDF("id", "a", "b", "features", "label")
.select($"id", ...
looks beautiful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry to spending time with this issue but I would like to be consistent and keep the rules so what about the following:
...
)
.toDF("id", "a", "b", "features", "label")
.select($"id", ...
So all indented by two spaces and the dots are aligned. Could you accept this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also confused about the align rule. @jkbradley what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib.
| (1, Vectors.dense(0.0, 1.0), Vectors.dense(0.0, 1.0), 1.0), | ||
| (2, Vectors.dense(1.0, 2.0), Vectors.dense(1.0, 2.0), 2.0) | ||
| ).toDF("id", "vec2", "features", "label") | ||
| .select($"id", $"vec2".as("vec2", metadata), $"features", $"label") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
| (1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0), | ||
| (2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0) | ||
| ).toDF("id", "a", "b", "features", "label") | ||
| .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
| (2, "bar", "zq", Vectors.dense(1.0, 0.0, 2.0), 0.0), | ||
| (3, "bar", "zy", Vectors.dense(1.0, 0.0, 3.0), 2.0) | ||
| ).toDF("id", "a", "b", "features", "label") | ||
| .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
| df, | ||
| sqlTrans, | ||
| "id1") { rows => | ||
| assert(df.storageLevel != StorageLevel.NONE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move assert(df.storageLevel != StorageLevel.NONE) to here seems meaningless, because you do not use rows parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I change it.
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review done. Thanks for updating so many testing code!
| expectedMessagePart : String, | ||
| firstResultCol: String) { | ||
|
|
||
| def hasExpectedMessage(exception: Throwable): Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt whether the check here is too strict. It require exactly match message so when some class modify the exception message then many testcase will fail.
Or can we just check the exception type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses contains. I would keep this behaviour as the test is more well spoken this way.
@jkbradley?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most other tests check parts of the message, I'm OK with this setup. When we don't think the message will remain stable, we can pass an empty string for expectedMessagePart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Did you have to add the getCause case because of streaming throwing wrapped exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the reason.
|
Test build #87904 has finished for PR 20686 at commit
|
|
Test build #88023 has finished for PR 20686 at commit
|
|
Thanks for the PR @attilapiros and @WeichenXu123 for the review! I'll take a look now. |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running out of time to do a complete review now, but I'll leave initial comments and continue later.
| .setInputCol("inputTokens") | ||
| .setOutputCol("nGrams") | ||
| val dataset = Seq(NGramTestData( | ||
| val dataFrame = Seq(NGramTestData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kinds of changes are not necessary and make the PR a lot longer. Would you mind reverting them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| Vectors.dense(0.6, -1.1, -3.0), | ||
| Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), | ||
| Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), | ||
| Vectors.sparse(3, Seq())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests.
I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading.
|
|
||
| @transient var data: Array[Vector] = _ | ||
| @transient var dataFrame: DataFrame = _ | ||
| @transient var normalizer: Normalizer = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say, though, that I'm happy with moving Normalizer into individual tests. It's weird how it is shared here since it's mutated within tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
I'll do a complete review now! |
|
Test build #88137 has finished for PR 20686 at commit
|
|
Test build #88139 has finished for PR 20686 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I finished a detailed pass.
|
|
||
| test("input column without ML attribute") { | ||
|
|
||
| ignore("input column without ML attribute") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the test but limit it to batch. People should switch to OneHotEncoderEstimator anyways.
| assert(rowForSingle.getDouble(0) == rowForMultiCols.getDouble(0) && | ||
| rowForSingle.getDouble(1) == rowForMultiCols.getDouble(1) && | ||
| rowForSingle.getDouble(2) == rowForMultiCols.getDouble(2)) | ||
| testTransformerByGlobalCheckFunc[(Double, Double, Double)]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this. Testing vs. multiCol is already testing batch vs streaming. No need to test singleCol against itself.
| assert(rows === expected) | ||
| } | ||
|
|
||
| testTransformerByGlobalCheckFunc[(Double, Double, Double)]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a repeat of the test just above?
| ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) | ||
| ).toDF("id", "a", "b", "features", "label") | ||
| // assert(result.schema.toString == resultSchema.toString) | ||
| .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib.
| assert(attrSkip.values.get === Array("b", "a")) | ||
| // Verify that we skip the c record | ||
| // a -> 1, b -> 0 | ||
| val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved outside of the testTransformerByGlobalCheckFunc method.
| } | ||
|
|
||
| test("ML attributes") { | ||
| ignore("ML attributes") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: do not ignore
|
|
||
| model.transform(densePoints1) // should work | ||
| model.transform(sparsePoints1) // should work | ||
| // should work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove "should work" comments : P
| intercept[AssertionError] { | ||
| model.transform(densePoints2).collect() | ||
| logInfo("Did not throw error when fit, transform were called on vectors of different lengths") | ||
| withClue("Did not found expected error message when fit, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found -> find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just use the original text
| intercept[SparkException] { | ||
| model.transform(densePoints2.repartition(2)).collect() | ||
| logInfo("Did not throw error when fit, transform were called on vectors of different lengths") | ||
| withClue("Did not found expected error message when fit, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
|
||
| vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty) | ||
| validateResults(vectorSlicer.transform(df)) | ||
| testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using a global check function when you don't need to. It'd be better to use testTransformer() since the test is per-row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I have chosen the global check function is the checks for the attributes:
val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result"))
val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected"))
assert(resultMetadata.numAttributes === expectedMetadata.numAttributes)
resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) =>
assert(a === b)
}This is part is not row based but more like result set based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense.
|
Test build #88214 has finished for PR 20686 at commit
|
|
Thanks for the updates & all the work this PR took @attilapiros and for the review @WeichenXu123 ! |
|
Merging to branch-2.3 too |
# What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <[email protected]> Closes #20686 from attilapiros/SPARK-22915. (cherry picked from commit 279b3db) Signed-off-by: Joseph K. Bradley <[email protected]>
|
Hi, @jkbradley , @WeichenXu123 , @attilapiros . This seems to break branch-2.3 for three days. Could you take a look please? |
|
I am checking. |
|
@dongjoon-hyun it seams to me on 2.3 during streaming if an exception happens within an ml feature then the feature generated message is not at the direct caused by exception but even one level deeper. If I am right I can create a separate PR for 2.3 with the correction. To be continued... |
|
Thanks, @attilapiros . You can test your PR if you put |
|
@dongjoon-hyun The jira is SPARK-23728 and the PR is #20852. Could you please enable Jenkins on that? |
|
Thanks. For branch testing, I was confused with maven testing. Since you create a PR against |
# What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <[email protected]> Closes apache#20686 from attilapiros/SPARK-22915.
# What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <[email protected]> Closes apache#20686 from attilapiros/SPARK-22915. (cherry picked from commit 279b3db) Signed-off-by: Joseph K. Bradley <[email protected]>
What changes were proposed in this pull request?
Adds structured streaming tests using testTransformer for these suites:
How was this patch tested?
They are unit test.