Statistical Analysis for Biological Image Data

Dimension Reduction and Image Clustering

Cryo-electron microscopy (cryo-EM) has been emerging as a powerful tool for obtaining high resolution three-dimensional (3-D) structures of biological macro-molecules in the last decade. Traditionally, an efficient 3-D structure determination is provided by X-ray crystallography when a large assembly of crystal can be obtained, allowing for signals to be recorded from the spots in the diffraction pattern. However, not every molecule can form a crystal, and sometimes it is beneficial to study a molecule as a single particle rather than in a crystal.

In contrast to X-ray crystallography, cryo-EM does not need crystals but can view dispersed biological molecules embedded in a thin layer of vitreous ice. The electron beam transmitting through the specimen generates two-dimensional projections of these freely oriented molecules and the 3-D structure can be obtained by back projections provided the angular relationships among them are determined.

As biological molecules are highly sensitive to electron beams, only very limited doses are allowed for imaging. This yields a barely recognizable image for an individual molecule. Nevertheless, as the molecule has high symmetry, for example an icosahedra virus, the symmetry would facilitate the image processing and allow for attaining atomic resolution structure. However, the de-noising for a particle of low or no symmetry is evidently challenging as it requires the alignment and clustering of many images of the same orientations for averaging.

Usually this task implies that a very large number of images are to be collected as each cluster of sufficient number of images is on demand for attaining near-atomic resolution. These images represent the projections of the same molecules with randomized translations and rotations, and free orientations. In order to determine the angular relationship among orientations for deriving the 3-D reconstruction, it is crucial to align the images and cluster them into classes of similar orientations. The alignment is aimed to register the images into the same molecular coordinates, and the representative image of each cluster will achieve some level of  denoising through averaging.

We have investigated two related statistical problems, the dimension reduction and the clustering algorithm. We employed the multilinear principal component analysis  to do the dimension reduction for thousands of the cryo-EM images simultaneously and a clustering algorithm γ-SUP for image clustering. This approach not only leads satisfactory clustering accuracy but also saves a huge amount of computation time.