On the application of VDM to the ribosome dataset #1327

garrettwrong · 2025-09-30T12:54:32Z

garrettwrong
Sep 30, 2025
Maintainer

This discussion is to summarize work done to compare the performance of the class averaging pipeline component between the MATLAB and Python codes. There are multiple high level differences, but as of this date they mainly come down to the choice of not implementing VDM, and subsequent associated algorithmic+setting changes supporting the VDM application. The Python code also includes a bit of a tinkerers toolbox infrastructure for people interested in the direct alignment problem. Both code's approaches are briefed at a very high level, then some reconstruction results will be compared.

Classification

Both codes should in theory and practice have a very similar RIR implementation. The Python code does optionally implement some optimizations discussed in the original paper that were not implemented in the MATLAB code (eg, true random nearest neighbor). The Python code does provide for the legacy call as well.

They do differ in the generation and use of the covariance matrix and PCA, with the legacy code using a bespoke complex cov matrix and complex PCA with the prolate basis. The Python code using our cov2d generated matrix and an extension of scikit's PCA. The Python code should take any of the 5 supported basis in that repo, but defaults to FFB2D.

Alignment Estimation

The Python code uses a direct rotational+translational alignment. The fast implementation uses GPU accelerated PFT cross corr to determine the rotations. An outer loop brtue forces over translations to an adjustable (sub-pixel) level, with the default being 1 pixel search steps in a small disc.

The MATLAB code performs initial alignment using the compressed FSPCA representation. I believe the starting basis for this operation is PSWF. This alignment and stacking is used to generate seeds for VDM. The VDM procedure uses the seeds to generate it's network. VDM then outputs new class member indices and rotational estimates to be used in the final alignment. These estimates generally speaking are using different images (and angles) than those found initially by RIR.

There is a align_main code that does a similar brute force alignment in MATLAB before VDM. With a lot of effort this can be run on full images and should yield similar results to the Python approach. This would create MATLAB class averages without VDM and is mentioned again at the end.

Stacking

Python translation estimates are applied via FFT translation, and rotation alignment estimates are applied in a steerable basis (currently defaulting to FFB2D). The images are stacked as basis coefficients and transformed back to the image domain.

In MATLAB the final image rotation is performed on the input images using the shearing method in the image domain, and means taken directly in the image domain.

Cross Comparison

After reviewing both code bases, it became pretty clear it would be a ton of work to recreate the MATLAB calls correctly using external (Python) data. Instead the MATLAB pipeline was run end to end using the workflow tool and my current understanding of the values I should be supplying it for the Ribosome dataset. Starting from the EMPIAR published 10028 dataset, this results in the following outputs that will be used for comparisons.

i) phaseflipped_cropped_downsampled_prewhitened_group1.mrcs (Preprocessed)
ii) averages_nn50_sorted_group1.mrcs (VDM)
iii) averages_nn50_EM_group1.mrcs (VDM->EM)

(i) were fed into ASPIRE-Python to generate class averages using the defaults in LegacyClassAvgSource, yielding:
iv)10028_var_sorted_cls_avgs_m50_179px_0_105246.mrcs

Thus all the runs should be starting from the exact same preprocessed data (i).

(ii), (iii) and (iv) were then all processed into a 3d volume using an identical default procedure in ASPIRE-Python. CommonLinesSync3N followed by MeanVolumeEstimator using the now default dirac basis (similar to MATLAB). FSC's were computed from the best of 3 calls using MATLAB's cryo_compare_volumes:

ii) (MATLAB denoised alignment est->VDM ) 9.9A
iii) (MATLAB denoised alignment est->VDM->EM) 9.7A
iv) (Python direct alignment) 12.7A

End to End comparison

ASPIRE-Python as published in the experiment examples (current develop branch) end to end achieved 12.0A at the time of writing. This is slightly lower than iv but within run to run variance. The tail end of the pipeline for all of the runs in both code bases exhibit run-run variability. It's not entirely clear to me at this time where the Python run to run variation is coming from, but we can probably sort that out later.

These runs using the workflow tool pipeline implementation end to end in MATLAB code yielded 7.6 A with EM, and 7.7 without EM. Paper reported 8.4A. These numbers are all within the variability of cryo_compare_volumes. And can be considered basically the same. I have repeatedly seen the EM call randomly crash, and also produce different results (often worse by about 1A). One thing I can say is that we probably do not need to port the MATLAB 2D EM code. This was also hinted at in discussions with Yoel. If ASPIRE wants 2D EM, it probably merits a fresh start.

Opinion and Next Steps

Overall, it appears there are similar gains (a few A) to be had in the mean volume reconstruction area vs adding the VDM component. Given that the reconstruction area was previously attempted to be implemented in Python but is not performing well, it would be my recommendation to either repair or replace that component first before adding new recon pipeline features.

FSC scores aside, the combination of using the compressed representation and VDM is comparable (slightly slower) in time to the optimized Python (GPU) full size image alignment. Optimistically that implies that an optimized version of the VDM approach could be quite quick with good development.

It is also possible that the MATLAB code isn't actually generating a better reconstruction, but rather, the reconstruction component is written/tuned in such a way that it yields better results using its own cryo_compare_volumes. I can try to test this by bringing the 12A Python run outputs into MATLAB at different stages to see if we get better reports, but this would be considerably more work as mentioned in the opening. I'm not convinced it is worth doing this unless/until a specific challenge point is identified.

However, in that direction, I am running the MATLAB align_main starting with the rotations, reflection, etc saved from MATLAB's own classification on the MATLAB preprocessed image stack. Based on current progress, this will probably take about a week to run in parallel and requires huge amounts of RAM; essentially the entire server (as compared to 4 hours or less for the Python code...). Assuming that succeeds, it will be reported back as well. That would give the closest comparison to what is currently implemented in Python without VDM. It would also yield another datapoint regarding what VDM is bringing to the MATLAB code for this dataset.

I will attempt the results mentioned, but due to GH filesize limits I will probably be limited to just the final volumes. The intermediates will be saved for future reference/use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On the application of VDM to the ribosome dataset #1327

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

On the application of VDM to the ribosome dataset #1327

Uh oh!

Uh oh!

garrettwrong Sep 30, 2025 Maintainer

Classification

Alignment Estimation

Stacking

Cross Comparison

End to End comparison

Opinion and Next Steps

Replies: 0 comments

garrettwrong
Sep 30, 2025
Maintainer