Skip to content

Conversation

@junchaoxia
Copy link
Collaborator

Closing issue #234.

@junchaoxia junchaoxia added the enhancement New feature or request label Oct 8, 2020
@junchaoxia junchaoxia self-assigned this Oct 8, 2020
@junchaoxia junchaoxia changed the base branch from develop_jx to develop October 23, 2020 21:52
@junchaoxia junchaoxia marked this pull request as ready for review October 23, 2020 21:52
@junchaoxia junchaoxia requested a review from janden as a code owner October 23, 2020 21:52
@junchaoxia
Copy link
Collaborator Author

@janden This PR have included the tox updates.

@junchaoxia junchaoxia requested a review from janden November 20, 2020 21:01
@junchaoxia
Copy link
Collaborator Author

My run on my laptop also passed all tests. I have asked [email protected] to enable the debug mode for our repo yesterday. I got reply back this morning and will try to do something today.

@garrettwrong
Copy link
Collaborator

Also, that grid_3d should be getting a dtype. We're sometimes using self.dtype, but most of those computations are still default (doubles) fwiw. That yields

5.14984130859375e-05 for the energy conservation delta, which is right near the cutoff.

(please disregard the prior numbers they were the wrong test method).

I think you can diagnose this a few different ways, but if you are doing debug mode that should be effective. Better we find it now than random users.

@junchaoxia
Copy link
Collaborator Author

here a short summary of travis-ci failure for py38.

  1. Our repo was enabled with debug mode, but due to the queued tests, we can not predict when the job is finished and catch the time window for debugging.
  2. PR Downsampe test only #353 was created to print out the intermediate variables of downsample test for comparison. If atol=1.0e-4, then the tests from py36, py37, and py38 can be passed. we can find py36 and py37 generate same rotations and clean images numerically. But the clean images from py38 are different numerically even the rotation is identical as shown below:

Rotation 0 for py36:
[[-0.74707854 -0.5670085 0.346951 ]
[-0.5469602 0.22772461 -0.8055905 ]
[ 0.37776738 -0.79160774 -0.48025924]]
Clean image 0 from py36:
[[-2.0351592e-08 -1.9684194e-08 -1.7325533e-08 ... -1.3675276e-08 -1.7501748e-08 -1.9674872e-08]
[-2.3241228e-08 -2.2717813e-08 -2.0721586e-08 ... -1.4828856e-08 -1.9046070e-08 -2.2039671e-08]
[-1.7560524e-08 -1.7943876e-08 -1.6655576e-08 ... -7.9992333e-09 -1.2651526e-08 -1.5853630e-08]
...
[-1.2806595e-08 -1.0838903e-08 -7.1149771e-09 ... -1.0289114e-08 -1.2330929e-08 -1.3364570e-08]
[-1.7401590e-08 -1.5734940e-08 -1.2949386e-08 ... -1.3913677e-08 -1.6121930e-08 -1.7554839e-08]
[-2.1939513e-08 -2.0568450e-08 -1.7678076e-08 ... -1.6331455e-08 -1.9656454e-08 -2.1554229e-08]]

Rotation 0 from py 37:
[[-0.74707854 -0.5670085 0.346951 ]
[-0.5469602 0.22772461 -0.8055905 ]
[ 0.37776738 -0.79160774 -0.48025924]]
Clean image 0 from py37:
[[-2.0351592e-08 -1.9684194e-08 -1.7325533e-08 ... -1.3675276e-08 -1.7501748e-08 -1.9674872e-08]
[-2.3241228e-08 -2.2717813e-08 -2.0721586e-08 ... -1.4828856e-08 -1.9046070e-08 -2.2039671e-08]
[-1.7560524e-08 -1.7943876e-08 -1.6655576e-08 ... -7.9992333e-08, -1.2651526e-08 -1.5853630e-08]
...
[-1.2806595e-08 -1.0838903e-08 -7.1149771e-09 ... -1.0289114e-08 -1.2330929e-08 -1.3364570e-08]
[-1.7401590e-08 -1.5734940e-08 -1.2949386e-08 ... -1.3913677e-08 -1.6121930e-08 -1.7554839e-08]
[-2.1939513e-08 -2.0568450e-08 -1.7678076e-08 ... -1.6331455e-08 -1.9656454e-08 -2.1554229e-08]]

Rotation 0 from py38:
[[-0.74707854 -0.5670085 0.346951 ]
[-0.5469602 0.22772461 -0.8055905 ]
[ 0.37776738 -0.79160774 -0.48025924]]
Clean image 0 from py38:
[[-5.4526481e-09 -4.8030984e-09 -2.4998599e-09 ... 1.0518306e-09 -2.6759608e-09 -4.7937760e-09]
[-6.4799224e-09 -5.6553517e-09 -3.4160621e-09 ... 7.0701844e-10 -3.0464662e-09 -5.6330691e-09]
[-4.5251909e-09 -4.3037289e-09 -2.4335804e-09 ... 3.1211584e-09 -8.7607077e-10 -3.4403911e-09]
...
[-3.4924597e-09 -2.1764208e-09 9.6270014e-10 ... 1.3878889e-09 -1.5038495e-09 -3.3278411e-09]
[-2.5033842e-09 -1.7068942e-09 1.8394530e-10 ... 3.3230663e-09 3.9608494e-10 -1.8221726e-09]
[-7.0393753e-09 -6.2132131e-09 -3.8930921e-09 ... -7.3100637e-11 -3.7888412e-09 -6.1453420e-09]]

  1. It might need to check the outout of finufft.

@garrettwrong
Copy link
Collaborator

I don't think issue is with Finufft, because that would have the exactly same version of that package between py38-stable and py38-dev (there is only one pypi release afaik). As I understand it, the py38-dev variant succeeds.. You may confirm this in the logs. If you do check them, be sure to check the inputs as well...

Other packages, like Numpy and scipy are not the same version between those (stable, dev) variants, they will be upgraded I suggested above that changing the numpy/scipy versions changes the output. It is possible the base version of certain packages for py38 are not agreeable.

@janden, if we want to pause apple picker I can look at this instead just to help get it out the door.

@junchaoxia
Copy link
Collaborator Author

junchaoxia commented Dec 10, 2020

In the new push, I have tried to print out the result after the nufft transformation. After that there is an centered_iff2 to get the clean images. If nufft is OK, probably it is due to centered_iff2 for the difference of clean images. Of course this is only for clean image before the downsample, there are other things in the downsample function might break the test too.

@janden
Copy link
Collaborator

janden commented Dec 12, 2020

Ok I see. Yeah it might be the `centered_ifft2' then that's causing problems then. Do we know if it's calling the SciPy FFT or FFTW here?. @garrettwrong if you want to pause on apple picker and take a look at this, I think it would be very helpful.

@garrettwrong
Copy link
Collaborator

Sure thing, I'll pick it up on Monday. I'm pretty suspicious about versions of the numerical libraries because py38-dev seems okay. The only difference being pip upgrades...

@garrettwrong
Copy link
Collaborator

garrettwrong commented Dec 14, 2020

py38-dev is using openblas after the pip upgrade, while py38-stable is not reporting any third party optimized library backend in the config. I have a py37-stable queued up now for comparison.

After noting the difference in BLAS I replaced the np.linalg.norm call in anorm with sqrt(sum(x*x)) for testing purposes (since we know norm and dot are sensitive from some other tests)... This changes the results enough to get under the cutoff that was failing. It is not clear to me if it is the final comparison norm or somewhere earlier in the procedure (though I should know that later).

I noticed some of the grid calls might have the wrong dtype, but that is probably red herring.

@garrettwrong
Copy link
Collaborator

I have a py37-stable queued up now for comparison.

py37-stable is also finding the system openblas. It seems there were some historical issues with Python 3.8 and older numpy/scipy (which we are using). I've tried bumping the versions just a little.

What I don't understand yet is why this particular unit test is the only one sufficiently debased.

@junchaoxia
Copy link
Collaborator Author

  1. I check the nufft output for py36-stable, py37-stable, and py38-stable, they look numerically consistent for most of them, but there are some exception such as some numbers in the last row of py38-stable as shown below:

py36-stable
[[-1.11001182e-08+1.56864814e-08j -1.37571439e-08+1.65238667e-08j
-8.51121484e-09+1.09221201e-08j ... -2.09223963e-08-7.51846141e-09j
-3.14078652e-09-2.43461029e-09j 2.17175256e-08+9.24582100e-09j]
[-1.50935353e-08+1.34608813e-08j -9.36470634e-09+1.41411451e-08j
-1.42585632e-09+1.07958442e-08j ... -1.44706345e-08-2.42284193e-09j
1.14384582e-08-6.52098464e-09j 2.08393462e-08-1.12866028e-09j]
[-8.86734330e-09+1.26912303e-08j -1.34354563e-08+1.17369732e-08j
-3.54980010e-08+2.93758373e-09j ... 3.56668495e-09-1.31948994e-08j
-7.55102558e-09-8.93524721e-09j 3.44529560e-08+6.35675068e-09j]
...
[ 2.28478303e-08-2.24653220e-08j 2.49199417e-08+6.12903461e-09j
3.17763593e-08+1.32718947e-08j ... -3.52691316e-08-8.33530134e-09j
-1.20200543e-08-1.20985577e-08j -7.26150695e-09-1.24739339e-08j]
[-1.83208950e-08-5.87044635e-10j 9.05679354e-09+1.07516271e-08j
-1.58550364e-08+2.98642361e-10j ... 2.58715249e-09-6.30401731e-09j
-1.00511599e-08-1.10683755e-08j -1.59317697e-08-1.29677300e-08j]
[-5.01651698e-09-1.02724531e-08j -1.65297056e-08+1.51897006e-09j
-2.59537636e-08+6.38636788e-09j ... -9.32439992e-09-7.26109528e-09j
-1.47572736e-08-1.46932235e-08j -1.13000969e-08-1.27981918e-08j]]

py37-stable
[[-1.11001182e-08+1.56864814e-08j -1.37571439e-08+1.65238667e-08j
-8.51121484e-09+1.09221201e-08j ... -2.09223963e-08-7.51846141e-09j
-3.14078652e-09-2.43461029e-09j 2.17175256e-08+9.24582100e-09j]
[-1.50935353e-08+1.34608813e-08j -9.36470634e-09+1.41411451e-08j
-1.42585632e-09+1.07958442e-08j ... -1.44706345e-08-2.42284193e-09j
1.14384582e-08-6.52098464e-09j 2.08393462e-08-1.12866028e-09j]
[-8.86734330e-09+1.26912303e-08j -1.34354563e-08+1.17369732e-08j
-3.54980010e-08+2.93758373e-09j ... 3.56668495e-09-1.31948994e-08j
-7.55102558e-09-8.93524721e-09j 3.44529560e-08+6.35675068e-09j]
...
[ 2.28478303e-08-2.24653220e-08j 2.49199417e-08+6.12903461e-09j
3.17763593e-08+1.32718947e-08j ... -3.52691316e-08-8.33530134e-09j
-1.20200543e-08-1.20985577e-08j -7.26150695e-09-1.24739339e-08j]
[-1.83208950e-08-5.87044635e-10j 9.05679354e-09+1.07516271e-08j
-1.58550364e-08+2.98642361e-10j ... 2.58715249e-09-6.30401731e-09j
-1.00511599e-08-1.10683755e-08j -1.59317697e-08-1.29677300e-08j]
[-5.01651698e-09-1.02724531e-08j -1.65297056e-08+1.51897006e-09j
-2.59537636e-08+6.38636788e-09j ... -9.32439992e-09-7.26109528e-09j
-1.47572736e-08-1.46932235e-08j -1.13000969e-08-1.27981918e-08j]]

py38-stable
[[-1.11001182e-08+1.56864814e-08j -1.37571439e-08+1.65238667e-08j
-8.51121484e-09+1.09221201e-08j ... -2.09223963e-08-7.51846141e-09j
-3.14078652e-09-2.43461029e-09j 2.17175256e-08+9.24582100e-09j]
[-1.50935353e-08+1.34608813e-08j -9.36471345e-09+1.41411345e-08j
-1.42585632e-09+1.07958442e-08j ... -1.44706345e-08-2.42284193e-09j
1.14384582e-08-6.52098464e-09j 2.08393462e-08-1.12866028e-09j]
[-8.86734330e-09+1.26912303e-08j -1.34354563e-08+1.17369732e-08j
-3.54980010e-08+2.93758373e-09j ... 3.56668495e-09-1.31948994e-08j
-7.55102558e-09-8.93524721e-09j 3.44529560e-08+6.35675068e-09j]
...
[ 2.28478303e-08-2.24653220e-08j 2.49199417e-08+6.12903461e-09j
3.17763593e-08+1.32718947e-08j ... -3.52691316e-08-8.33530134e-09j
[-1.83208950e-08-5.87044635e-10j 9.05679354e-09+1.07516271e-08j
-1.58550364e-08+2.98609498e-10j ... 2.58715249e-09-6.30401731e-09j
-1.00511652e-08-1.10683542e-08j -1.59317697e-08-1.29677300e-08j]
[-5.01651698e-09-1.02724531e-08j -1.65297056e-08+1.51897006e-09j
-2.59537636e-08+6.38636788e-09j ... -9.32437416e-09-7.26109217e-09j
-1.47572523e-08-1.46931960e-08j -1.13000969e-08-1.27981918e-08j]]

  1. I did not find py36-stable, py37-stable, and py38-stable to uninstall the packages to fit our need. For py38-dev, a few packages are uninstalled fit the need as shown below:
    ERROR: aspire 0.6.0 has requirement numpy==1.16, but you'll have numpy 1.19.4 which is incompatible.
    ERROR: aspire 0.6.0 has requirement numpydoc==0.7.0, but you'll have numpydoc 1.1.0 which is incompatible.
    ERROR: aspire 0.6.0 has requirement pandas==0.25.3, but you'll have pandas 1.1.5 which is incompatible.
    ERROR: aspire 0.6.0 has requirement scikit-image==0.14.0, but you'll have scikit-image 0.17.2 which is incompatible.
    ERROR: aspire 0.6.0 has requirement scipy==1.4.0, but you'll have scipy 1.5.4 which is incompatible.
    Installing collected packages: scipy
    Found existing installation: scipy 1.4.0
    Uninstalling scipy-1.4.0:
    Successfully uninstalled scipy-1.4.0
    Successfully installed scipy-1.5.4

The strange thing for py38-dev is that it installed scipy-1.5.4 not scipy-1.4.0 that we ask, but the unit test were passed.

  1. I will try to confirm that there are any differences from centered_fft2 which call scipy.fftpack.

@garrettwrong
Copy link
Collaborator

I'm having trouble understanding most of your comment. Some of it doesn't make much sense.

This isn't strange because that is how the tox testing environments are designed to work, as we have discussed previously. The -dev environments systematically upgrade the pip packages. This way we cover the old (pinned) versions, and also newer, potentially rolling, versions of packages. These environments are designed to mimic what some users might experience using pip (notably outside of conda). Generally I expect this to find things when using upstream (newer) packages. In this case, we found something using a newer python with our pretty old packages....

I described a source of the numerical issue already above. Under Python 3.8 on this platform the older (pinned) versions of numpy/scipy are not finding openblas, as we are in the other environments. This can easily be demonstrated. I also wrote a specific calculation that you can replace to exercise this (and pass the test..).

A lot of people had similar problems with numpy/scipy when 3.8.0 was released. Most of the time developers just told people to upgrade as bug fixes rolled out in later versions.... which is an option... but I have not been able to find any compatibility matrices to cite a minimal upgrade. However, we already know the latest stable versions work fine....

There might be some simpler things to do.

It would not surprise me if anything FFT returns different results using different numerical libraries.

@junchaoxia
Copy link
Collaborator Author

junchaoxia commented Dec 14, 2020

@garrettwrong For my first point I try to confirm that the images after nufft should the same for all python versions since the volume and rotation are identical and we are use the same version of nufft lib. Some numbers from the last row of py38-stable has slightly differences to that of py36 and py37. But probably it is OK.
For the second point, I can see new versions of numpy, pandas, pandas, and scikit-image in py38 were uninstalled and our required old versions are installed. But scipy is different, 1.5.4 was installed instead of 1.4.0 in our list.

I did not follow the anorm and np.linalg.norm previously. But it does can generate accumulated error if we use large number of images and high resolution. In default, if we do not specify the axes parameter, it will flatten all images to 1D array and calculate all of them. Actually in our case we only need to calculate norm for each image not for whole stack of images. I have modified the test and pushed back. Let's see this will reduce the error or not.

@garrettwrong
Copy link
Collaborator

garrettwrong commented Dec 14, 2020

the last row of py38-stable has slightly differences to that of py36 and py37. But probably it is OK.

hrrm, maybe diffs typical of different numerical libraries...

For the second point, I can see new versions of numpy, pandas, pandas, and scikit-image in py38 were uninstalled and our required old versions are installed. But scipy is different, 1.5.4 was installed instead of 1.4.0 in our list.

No, only -dev performs such an upgrade. -stable uses 1.4.0. No attempt is made to upgrade or downgrade versions outside of the initial pip install for the -stable environments. I think you are confusing logs from different environments.

I did not follow the anorm and np.linalg.norm previously. But it does can generate accumulated error if we use large number of images and high resolution. In default, if we do not specify the axes parameter, it will flatten all images to 1D array and calculate all of them. Actually in our case we only need to calculate norm for each image not for whole stack of images. I have modified the test and pushed back. Let's see this will reduce the error or not.

So all these tests might have implored us to catch a bug in the test code. That would be satisfying.

FWIW, installing openblas to the system before the pip install does allow the py38-stable to pass. Perhaps that norm line was tickled a bit too much by the implementation difference... Probably you can take it from here. (If your anorm change works, that would be best, since having different libraries would actually have proven useful...).

--- a/.travis.yml
+++ b/.travis.yml
@@ -24,7 +24,7 @@ jobs:
       env: TOXENV=py38-dev
 
 install:
+  - sudo apt-get install -y libopenblas-dev
   - pip install -U tox tox-travis

@janden
Copy link
Collaborator

janden commented Dec 14, 2020

So if I understand correctly, the issue seems to be that we're using a different BLAS implementation in py38-stable compared to the other envs? And that implementation has numerical stability issues in its np.dot implementation. This is why the problem goes away when we 1) force it to use the same BLAS implementation, 2) swap out np.dot with sum), or 3) work with smaller sums?

@junchaoxia
Copy link
Collaborator Author

@janden I submitted two new jobs: in this PR for calculating norms from each image and in PR #353 for using the same libopenblas-dev. They are queued over there. Probably need to wait till tomorrow.

@garrettwrong
Copy link
Collaborator

garrettwrong commented Dec 14, 2020

Correct. As I understand it, the base Python 3.8 configuration is not installing openblas, but the other base python environments are; and this is based on various Python package defaults, not our code (unless we manually do so like I pasted above). The -dev environments also install openblas as part of their pip install upgrades. py38-stable is unique in this way.

I don't know that not using openblas is directly an issue, so much as it has the effect of tickling that norm line. But generally yes on those three points. If taking a more targeted norm works, that would be a better fix.

@janden
Copy link
Collaborator

janden commented Dec 15, 2020

If taking a more targeted norm works, that would be a better fix.

You mean looking at norms per image instead of overall norms? Then yes, I agree.

@junchaoxia
Copy link
Collaborator Author

junchaoxia commented Dec 15, 2020

@janden @garrettwrong The submitted job finished last night. Here is a short summary.

  1. The test can be passed just by changing to compare each image instead of calculating anorm from the whole stack of images as shown in the commit beaea3c.

  2. Add installation of openblas-dev (PR Downsampe test only #353) can get consistent anorm accuracy of all images at -8.94e-07 and the downsample images are numerically identical for all python versions as below.

Downsample image 0 from py36-stable
[[-2.8540853e-08 -2.3724340e-08 -1.2211842e-08 ... -6.1221499e-09 -1.7272157e-08 -2.6453279e-08]
[-1.5860451e-08 -1.3198132e-08 -6.6291932e-09 ... 1.0347321e-08 -1.0429630e-09 -1.0446911e-08]
[-1.1178599e-08 -1.3101271e-08 -1.0954409e-08 ... 2.0796961e-08 6.0754246e-09 -5.5406417e-09]
...
[-1.3704266e-08 -6.6102075e-09 7.2705006e-09 ... -1.2678356e-09 -8.2927727e-09 -1.2885721e-08]
[-6.0754246e-09 -9.2131813e-10 9.3664312e-09 ... -2.0513653e-09 -5.1659299e-09 -7.4678610e-09]
[-2.1472260e-08 -1.6415470e-08 -6.9172756e-09 ... -6.7773271e-09 -1.2421879e-08 -1.8898390e-08]]

Downsample image 0 from py37-stable
[[-2.8540853e-08 -2.3724340e-08 -1.2211842e-08 ... -6.1221499e-09 -1.7272157e-08 -2.6453279e-08]
[-1.5860451e-08 -1.3198132e-08 -6.6291932e-09 ... 1.0347321e-08 -1.0429630e-09 -1.0446911e-08]
[-1.1178599e-08 -1.3101271e-08 -1.0954409e-08 ... 2.0796961e-08 6.0754246e-09 -5.5406417e-09]
...
[-1.3704266e-08 -6.6102075e-09 7.2705006e-09 ... -1.2678356e-09 -8.2927727e-09 -1.2885721e-08]
[-6.0754246e-09 -9.2131813e-10 9.3664312e-09 ... -2.0513653e-09 -5.1659299e-09 -7.4678610e-09]
[-2.1472260e-08 -1.6415470e-08 -6.9172756e-09 ... -6.7773271e-09 -1.2421879e-08 -1.8898390e-08]]

Downsample image 0 from py38-stable
[[-2.8540853e-08 -2.3724340e-08 -1.2211842e-08 ... -6.1221499e-09 -1.7272157e-08 -2.6453279e-08]
[-1.5860451e-08 -1.3198132e-08 -6.6291932e-09 ... 1.0347321e-08 -1.0429630e-09 -1.0446911e-08]
[-1.1178599e-08 -1.3101271e-08 -1.0954409e-08 ... 2.0796961e-08 6.0754246e-09 -5.5406417e-09]
...
[-1.3704266e-08 -6.6102075e-09 7.2705006e-09 ... -1.2678356e-09 -8.2927727e-09 -1.2885721e-08]
[-6.0754246e-09 -9.2131813e-10 9.3664312e-09 ... -2.0513653e-09 -5.1659299e-09 -7.4678610e-09]
[-2.1472260e-08 -1.6415470e-08 -6.9172756e-09 ... -6.7773271e-09 -1.2421879e-08 -1.8898390e-08]]

@garrettwrong
Copy link
Collaborator

I think the best thing to do is to just fix the norm call as you have applied it. I like that the CI platform did not admit the test code and drove the issue to be addressed. I would prefer to leave the library situation as it was in develop unless it becomes absolutely necessary to force it. (That's my vote at least, fwiw).

@junchaoxia
Copy link
Collaborator Author

@garrettwrong I committed the change of adding libopenblas-dev before seeing your comment. I will wait @janden and revert it if necessary.

@janden
Copy link
Collaborator

janden commented Dec 16, 2020

Great. Let's revert the library change (would be good to keep as close an environment as possible to what end users will see) and we should be good to go.

@junchaoxia
Copy link
Collaborator Author

@janden Reset the commit to remove the change of adding openblas lib.

@junchaoxia junchaoxia merged commit fc4d7f7 into develop Dec 17, 2020
@garrettwrong garrettwrong deleted the preprocess_test branch February 10, 2021 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants