[SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 #25475

HyukjinKwon · 2019-08-16T09:34:01Z

What changes were proposed in this pull request?

This PR proposes to fix both tests below:

======================================================================
FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4))
AssertionError: False is not true

File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 386, in __main__.GaussianMixtureModel
Failed example:
    abs(softPredicted[0] - 1.0) < 0.001
Expected:
    True
Got:
    False
**********************************************************************
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 388, in __main__.GaussianMixtureModel
Failed example:
    abs(softPredicted[1] - 0.0) < 0.001
Expected:
    True
Got:
    False

to pass in JDK 11.

The root cause seems to be different float values being understood via Py4J. This issue also was found in #25132 before.

When floats are transferred from Python to JVM, the values are sent as are. Python floats are not "precise" due to its own limitation - https://docs.python.org/3/tutorial/floatingpoint.html.
For some reasons, the floats from Python on JDK 8 and JDK 11 are different, which is already explicitly not guaranteed.

This seems why only some tests in PySpark with floats are being failed.

So, this PR fixes it by increasing tolerance in identified test cases in PySpark.

Why are the changes needed?

To fully support JDK 11. See, for instance, #25443 and #25423 for ongoing efforts.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested as described in JIRAs:

$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python

$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables python

HyukjinKwon · 2019-08-16T09:34:23Z

cc @WeichenXu123, @srowen, @dongjoon-hyun, this fixes PySpark tests on JDK 11.

dongjoon-hyun · 2019-08-16T09:55:44Z

Wow. Thank you, @HyukjinKwon !

HyukjinKwon · 2019-08-16T09:57:47Z

python/pyspark/mllib/clustering.py

    True
    >>> model.predict([-0.1,-0.05])
    0
    >>> softPredicted = model.predictSoft([-0.1,-0.05])


For instance, weights within Gaussian mixture model:

JDK 8

weights: WrappedArray(0.49520257460263445, 0.33813075873069875, 0.16666666666666685)

JDK 11

weights: WrappedArray(0.5000000000000001, 0.33333333333333326, 0.16666666666666666)

Also probably OK for the same reason. The test was too specific.

SparkQA · 2019-08-16T10:02:05Z

Test build #109210 has finished for PR 25475 at commit 0720268.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-08-16T10:09:07Z

python/pyspark/ml/tests/test_algorithms.py

        self.assertTrue(result.prediction, expected_prediction)
        self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4))
-        self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4))
+        self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))


Is 1 the minimum difference?

Yup ..

JDK 8:

[-11.19194106875243,-7.677866573997363,21.280214474039443]

JDK 11:

[-11.608192299802019,-8.158279986906651,22.177570449962918]

Seems multiple floats affects the results while they are roughly correct.

I'm not sure where the difference comes from, but it could be subtle differences in randomization or something across the JDKs. If these two tests are the only ones that vary, I think we're OK. I agree with loosening the bound here as these are log-odds, and I suspect the test values were picked just because it's what some previous run spit out (that is, it's too specific)

dongjoon-hyun · 2019-08-16T10:11:09Z

+1. This PR looks reasonable and good to me.

HyukjinKwon · 2019-08-16T10:46:56Z

Im going to just merge it. This is test-only PR and should always be fixed later. I roughly checked with @WeichenXu123 too offline as well.

HyukjinKwon · 2019-08-16T10:47:16Z

Merged to master.

Fix PySpark ML tests to pass in JDK 11

0720268

HyukjinKwon commented Aug 16, 2019

View reviewed changes

dongjoon-hyun reviewed Aug 16, 2019

View reviewed changes

dongjoon-hyun approved these changes Aug 16, 2019

View reviewed changes

dongjoon-hyun added ML PYSPARK labels Aug 16, 2019

HyukjinKwon changed the title ~~[SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests to pass in JDK 11~~ [SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 Aug 16, 2019

HyukjinKwon closed this in ef14237 Aug 16, 2019

HyukjinKwon deleted the SPARK-28735 branch March 3, 2020 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 #25475

[SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 #25475

Uh oh!

HyukjinKwon commented Aug 16, 2019 •

edited

Loading

Uh oh!

HyukjinKwon commented Aug 16, 2019

Uh oh!

dongjoon-hyun commented Aug 16, 2019

Uh oh!

HyukjinKwon Aug 16, 2019

Uh oh!

srowen Aug 16, 2019

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

dongjoon-hyun Aug 16, 2019

Uh oh!

HyukjinKwon Aug 16, 2019

Uh oh!

srowen Aug 16, 2019

Uh oh!

dongjoon-hyun commented Aug 16, 2019

Uh oh!

HyukjinKwon commented Aug 16, 2019

Uh oh!

HyukjinKwon commented Aug 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 #25475

[SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 #25475

Uh oh!

Conversation

HyukjinKwon commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Aug 16, 2019

Uh oh!

dongjoon-hyun commented Aug 16, 2019

Uh oh!

HyukjinKwon Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

dongjoon-hyun Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 16, 2019

Uh oh!

HyukjinKwon commented Aug 16, 2019

Uh oh!

HyukjinKwon commented Aug 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Aug 16, 2019 •

edited

Loading