Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 16, 2019

What changes were proposed in this pull request?

This PR proposes to fix both tests below:

======================================================================
FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4))
AssertionError: False is not true
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 386, in __main__.GaussianMixtureModel
Failed example:
    abs(softPredicted[0] - 1.0) < 0.001
Expected:
    True
Got:
    False
**********************************************************************
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 388, in __main__.GaussianMixtureModel
Failed example:
    abs(softPredicted[1] - 0.0) < 0.001
Expected:
    True
Got:
    False

to pass in JDK 11.

The root cause seems to be different float values being understood via Py4J. This issue also was found in #25132 before.

When floats are transferred from Python to JVM, the values are sent as are. Python floats are not "precise" due to its own limitation - https://docs.python.org/3/tutorial/floatingpoint.html.
For some reasons, the floats from Python on JDK 8 and JDK 11 are different, which is already explicitly not guaranteed.

This seems why only some tests in PySpark with floats are being failed.

So, this PR fixes it by increasing tolerance in identified test cases in PySpark.

Why are the changes needed?

To fully support JDK 11. See, for instance, #25443 and #25423 for ongoing efforts.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested as described in JIRAs:

$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables python

@HyukjinKwon
Copy link
Member Author

cc @WeichenXu123, @srowen, @dongjoon-hyun, this fixes PySpark tests on JDK 11.

@dongjoon-hyun
Copy link
Member

Wow. Thank you, @HyukjinKwon !

True
>>> model.predict([-0.1,-0.05])
0
>>> softPredicted = model.predictSoft([-0.1,-0.05])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, weights within Gaussian mixture model:

JDK 8

weights: WrappedArray(0.49520257460263445, 0.33813075873069875, 0.16666666666666685)

JDK 11

weights: WrappedArray(0.5000000000000001, 0.33333333333333326, 0.16666666666666666)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably OK for the same reason. The test was too specific.

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109210 has finished for PR 25475 at commit 0720268.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 the minimum difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup ..

JDK 8:

[-11.19194106875243,-7.677866573997363,21.280214474039443]

JDK 11:

[-11.608192299802019,-8.158279986906651,22.177570449962918]

Seems multiple floats affects the results while they are roughly correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where the difference comes from, but it could be subtle differences in randomization or something across the JDKs. If these two tests are the only ones that vary, I think we're OK. I agree with loosening the bound here as these are log-odds, and I suspect the test values were picked just because it's what some previous run spit out (that is, it's too specific)

@dongjoon-hyun
Copy link
Member

+1. This PR looks reasonable and good to me.

@HyukjinKwon
Copy link
Member Author

Im going to just merge it. This is test-only PR and should always be fixed later. I roughly checked with @WeichenXu123 too offline as well.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon HyukjinKwon changed the title [SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests to pass in JDK 11 [SPARK-28736][SPARK-28735][PYTHON][ML][TESTS] Fix PySpark ML tests to pass in JDK 11 Aug 16, 2019
@HyukjinKwon HyukjinKwon deleted the SPARK-28735 branch March 3, 2020 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants