γ-SUP: a clustering algorithm for cryo-electron microscopy images
of asymmetric particles

Annals of Applied Statistics (2013), in press
pdf download (1040 downloads)

Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful tool for obtaining three-dimensional (3D) structures of biological macromolecules in native states. A minimum cryo-EM image data set for deriving a meaningful reconstruction is comprised of thousands of randomly orientated projections of

identical particles photographed with small number of electrons. The computation of 3D structure from 2D projections requires clustering, which aims to enhance the signal to noise ratio in each view by grouping similarly oriented images. Nevertheless, the prevailing clustering techniques are often compromised by three characteristics of cryo-EM data: high noise content, high dimensionality and large number of clusters. Moreover, since clustering requires registering images of similar orientation into the same pixel coordinates by 2D alignment, it is desired that the clustering algorithm can label misaligned images as outliers. Herein, we introduce a clustering algorithm γ-SUP to model the data with a q-Gaussian mixture and adopt the minimum γ-divergence for estimation, and then use a self-updating procedure to obtain the numerical solution. We apply γ-SUP to the cryo-EM images of two benchmark macromolecules, RNA polymerase II and ribosome. In the former case, simulated images were chosen to decouple clustering from alignment to demonstrate γ-SUP is more robust to misalignment outliers than the existing clustering methods used in the cryo-EM community. In the latter case, the clustering of real cryo-EM data by our γ-SUP method eliminates noise in many views to reveal true structure features of ribosome at the projection level.


Robust independent component analysis via minimum
divergence estimation

IEEE Journal of Selected Topics in Signal Processing (2013) 7 (4): 614-624
pdf download (1473 downloads)

Independent component analysis (ICA) has been shown to be useful in many applications. However, most ICA methods are sensitive to data contamination and outliers. In this article we introduce a general minimum U-divergence framework for ICA, which covers some standard ICA methods as special cases.

Within the U-family we further focus on the γ-divergence due to its desirable property of super robustness, which gives the proposed method γ-ICA. Statistical properties and technical conditions for the consistency of γ-ICA are rigorously studied. In the limiting case, it leads to a necessary and sufficient condition for the consistency of MLE-ICA. This necessary and sufficient condition is weaker than the condition known in the literature. Since the parameter of interest in ICA is an orthogonal matrix, a geometrical algorithm based on gradient flows on special orthogonal group is introduced to implement γ-ICA. Furthermore, a data-driven selection for the value, which is critical to the achievement of γ-ICA, is developed. The performance, especially the robustness, of γ-ICA in comparison with standard ICA methods is demonstrated through experimental studies using simulated data and image data.


On multilinear principal component analysis of order-two tensors

Biometrika (2012) 99 (3): 569-583
pdf download (279 downloads)

Principal component analysis is commonly used for dimension reduction in analyzing high dimensional data. Multilinear principal component analysis aims to serve a similar function for analyzing tensor structure data, and has empirically been shown effective in reducing dimensionality. In this paper, we investigate its

statistical properties and demonstrate its advantages. Conventional principal component analysis, which vectorizes the tensor data, may lead to inefficient and unstable prediction due to the often extremely large dimensionality involved. Multilinear principal component analysis, in trying to preserve the data structure, searches for low-dimensional projections and, thereby, decreases dimensionality more efficiently. Asymptotic theory of order-two multilinear principal component analysis, including asymptotic efficiency and distributions of principal components, associated projections, and the explained variance, is developed. A test of dimensionality is also proposed. Finally, multilinear principal component analysis is shown to improve conventional principal component analysis in analyzing the Olivetti faces data set, which is achieved by extracting a more modularly-oriented basis set in reconstructing the test faces.


Regulation of mammalian transcription by Gdown1 through
a novel steric crosstalk revealed by cryo-EM

EMBO Journal (2012) 31 (17): 3575-87

In mammals, a distinct RNA polymerase II form, RNAPII(G) contains a novel subunit Gdown1 (encoded by POLR2M), which represses gene activation, only to be reversed by the multisubunit Mediator co-activator. Here, we employed single-particle cryo-electron microscopy (cryo-EM) to disclose the architectures of

RNAPII(G), RNAPII and RNAPII in complex with the transcription initiation factor TFIIF, all to B19A°. Difference analysis mapped Gdown1 mostly to the RNAPII Rpb5 shelf-Rpb1 jaw, supported by antibody labelling experiments. These structural features correlate with the moderate increase in the efficiency of RNA chain elongation by RNAP II(G). In addition, our updated RNAPII–TFIIF map showed that TFIIF tethers multiple regions surrounding the DNA-binding cleft, in agreement with cross-linking and biochemical mapping. Gdown1’s binding sites overlap extensively with those of TFIIF, with Gdown1 sterically excluding TFIIF from RNAPII, herein demonstrated by competition assays using size exclusion chromatography. In summary, our work establishes a structural basis for Gdown1 impeding initiation at promoters, by obstruction of TFIIF, accounting for an additional dependent role of Mediator in activated transcription.


Toward automated denoising of single molecular Förster
resonance energy transfer data

Journal of Biomedical Optics (2012) 17 (1): 011007
pdf download (240 downloads)

A wide-field two-channel fluorescence microscope is a powerful tool as it allows for the study of conformation dynamics of hundreds to thousands of immobilized single molecules by Förster resonance energy transfer (FRET) signals. To date, the data reduction from a movie to a final set containing meaningful

single-molecule FRET (smFRET) traces involves human inspection and intervention at several critical steps, greatly hampering the efficiency at the post-imaging stage. To facilitate the data reduction from smFRET movies to smFRET traces and to address the noise-limited issues, we developed a statistical denoising system toward fully automated processing. This data reduction system has embedded several novel approaches. First, as to background subtraction, high-order singular value decomposition (HOSVD) method is employed to extract spatial and temporal features. Second, to register and map the two color channels, the spots representing bleeding through the donor channel to the acceptor channel are used. Finally, correlation analysis and likelihood ratio statistic for the change point detection (CPD) are developed to study the two channels simultaneously, resolve FRET states, and report the dwelling time of each state. The performance of our method has been checked using both simulation and real data.


Zernike phase plate cryoelectron microscopy facilitates single
particle analysis of unstained asymmetric protein complexes

Structure (2010) 18 (1): 17-27
pdf download (263 downloads)

Single particle reconstruction from cryoelectron microscopy images, though emerging as a powerful means in structural biology, is faced with challenges as applied to asymmetric proteins smaller than megadaltons due to low contrast. Zernike phase plate can improve the contrast by restoring the microscope

contrast transfer function. Here, by exploiting simulated Zernike and conventional defocused cryoelectron microscope images with noise characteristics comparable to those of experimental data, we quantified the efficiencies of the steps in single particle analysis of ice-embedded RNA polymerase II (500 kDa), transferrin receptor complex (290 kDa), and T7 RNA polymerase lysozyme (100 kDa). Our results show Zernike phase plate imaging is more effective as to particle identification and also sorting of orientations, conformations, and compositions. Moreover, our analysis on image alignment indicates that Zernike phase plate can, in principle, reduce the number of particles required to attain near atomic resolution by 10–100 fold for proteins between 100 kDa and 500 kDa.