gmm-rec is software for inferring initial models in cryo-electron microscopy. It is an implementation of the algorithm described in
Joubert, P., and Habeck, M. 2015. Bayesian inference of initial models in cryo-electron microscopy using pseudo-atoms. Biophysical Journal. 108:1165-1175.
gmm-rec can be executed from the command line as gmm-rec.py. Three example datasets are included.
The software consists of Python and Cython code. It was developed and tested on Ubuntu 14.04. To install, first install all the depencies below, then follow the installation instructions.
The following list of programs should be installed first. The versions given in parentheses are recent enough; older versions may also suffice.
Required:
Optional:
Extract the code to a directory DIR of your choice. Compile and install by running from the console:
cd DIR/gmm-rec python setup.py build sudo python setup.py install
Ignore the multiple Cython warnings.
In the directory DIR/gmm-rec/examples are three datasets for testing the algorithm.
As an example, to infer an initial model for RNA polymerase II, run:
cd DIR/gmm-rec/examples/pol2 ./rec-script.sh
The script calls the main command line program gmm-rec.py with appropriate algorithmic parameters. It also calls gmm-plot-stack.py and gmm-plot-stats.py to create images and graphs from the results.
The algorithm takes several minutes to complete. The following output files are produced:
coarse.mrc: The final model evaluated on a regular three-dimensional grid. Can be viewed with Chimera. coarse_mirror.mrc: Mirror image of the final model, i.e. with the opposite handedness. coarse_mixture.csv: Parameters of the final pseudo-atomic model. coarse_transformations.csv: The final rotations and in-plane translations. coarse_stats.csv: The pseudo-atom size (std) and the log-posterior at each step. coarse_input.png: Data used as input to the algorithm. coarse_initial_projections.png: Projections of the random model used to initialise the algorithm along the initial random projection directions. coarse_final_projections.png: Projections of the final model. This is just a single sample from the posterior, not the posterior mean. If the algorithm was successful, these images should be similar to coarse_input.png. coarse_stats_std.png: Figure using std column from coarse_stats.cvs. coarse_stats_log-posterior.png: Figure using log-posterior column from coarse_stats.cvs.
These are the files for the initial stage (coarse). Replace coarse by refine for the files produced during the refinement stage.
Modify the script to experiment with different reconstruction parameters, or run the program gmm-rec directly from the command line as gmm-rec.py.
The phantom dataset is the same one as was used by
Jaitly, N., et. al. 2010. A Bayesian method for 3D macromolecular structure inference using class average images from single particle electron microscopy. Bioinformatics. 26:2406-2415.
It consists of 50 images of size 32x32. The algorithm takes about 2 minutes for the initial part, and 3 minutes for the refinement part.
The pol2 dataset is simulated data of RNA polymerase II. It consists of 25 class averages of size 50x50. The algorithm takes about 3 minutes for the initial part, and 2.5 minutes for the refinement.
The ribosome dataset is class averages of the 70S ribosome obtained using RELION from the same raw data as used in our paper. The RELION class averaging step includes full CTF correction. There are 22 images of size 130x130. The algorithm takes about 15 minutes for the initial part, and 3 minutes for the refinement. The initial part infers multiple initial models in parallel, and then refines only the one with the highest log-posterior.
Type
gmm-rec.py --help
for a brief description of the various options. Also see the various scripts rec-script.sh for examples. Here we describe the different options in more depth.
The output mrc file that will contain the final reconstruction is specified using -o and is of the form NAME.mrc. This NAME is used to form the filenames of other output files as well.
The algorithm consists of an initial stage and a refinement stage. The flag -g specifies that we're using the initial stage, with global rotation optimization. If -g is set, then you should also set the number of random rotations (-r), and specify the number of mixture and transformation sampling steps as additional arguments to -s (see below).
An mrc stack with the input images is given as the first argument to gmm-rec. These images can be resized to any shape (d, d) prior to reconstruction by specifying -d in Angstrom. The input pixel size of the original images is specified in Angstrom using -a. The resized images that are used as input to the algorithm are stored in NAME_input.mrc.
The prior parameters can be specified using --prior-mean, --prior-variance and --prior-weights. The default values of these parameters should work in most cases, see the paper for details.
The number of counts per image is specified by -N. The number of components in the mixture model is -K.
After running the algorithm, the final mixture and transformations are stored in csv files (NAME_mixture.csv and NAME_transformations.csv). These can be used as initial mixtures or transformations for subsequent reconstructions using -m or -t. If not specified, the initial mixture and/or transformations will be sampled from the prior. To fix the mixture and/or transformations during reconstruction, use the flags -M and -T.
The number of steps is specified using -s. Certain parameters such as the log-posterior and component size are recorded for each step in NAME_stats.csv. When using the initial stage (i.e. when -g is specified), then -s can have 3 parameters: the total number of sampling steps, followed by the number of mixture and transformation sampling steps.
Multiple initial models can be inferred in parallel by setting -p to a value higher than the default value of 1. The result is the model with the highest log-posterior, but the other models can be inspected as well. This is typically used during the initial stage, see the ribosome dataset for an example.
The threshold at which to truncate Gaussians for computational efficiency is specified by --std-range. The default value is 5, meaning that each Gaussian will only be evaluated at grid points within 5 times the standard deviation from the mean.
After reconstruction, the final mixture is evaluated on a 3D grid with grid spacing given by -A in Angstrom, to be viewed in chimera. Along with the final mixture NAME.mrc, a structure with the opposite handedness is stored in NAME_mirror.mrc. Both of these are equally likely given the available data.
Mrc files representing stacks of images can be viewed by converting them to png's using gmm-plot-stack.py.
Finally, the verbosity level can be adjust by using either -v or -vv, or omitting it to suppress all output. Note that the time remaining will initially be overestimated, because it assumes that the component standard deviation (std) stays fixed.
gmm-rec was written by Paul Joubert. It is released under the terms of the MIT License.