2SDR: Applying Kronecker Envelope PCA to denoise Cryo-EM Images

arXiv Preprint (2019)

Principal component analysis (PCA) is arguably the most widely used dimension reduction method for vector type data. When applied to image data, PCA demands the images to be portrayed as vectors. The resulting computation is heavy because it will solve an eigenvalue problem of a huge covariance matrix due to the vectorization step. To mitigate the computation burden, multilinear PCA (MPCA) that generates each basis vector using a column vector and a row vector with a Kronecker product was introduced, for which the success was demonstrated on face image sets.

However, when we apply MPCA on the cryo-electron microscopy (cryo-EM) particle images, the results are not satisfactory when compared with PCA. On the other hand, to compare the reduced spaces as well as the number of parameters of MPCA and PCA, Kronecker Envelope PCA (KEPCA) was proposed to provide a PCA-like basis from MPCA. Here, we apply KEPCA to denoise cryo-EM images through a two-stage dimension reduction (2SDR) algorithm. 2SDR first applies MPCA to extract the projection scores and then applies PCA on these scores to further reduce the dimension. 2SDR has two benefits that it inherits the computation advantage of MPCA and its projection scores are uncorrelated as those of PCA. Testing with three cryo-EM benchmark experimental datasets shows that 2SDR performs better than MPCA and PCA alone in terms of the computation efficiency and denoising quality. Remarkably, the denoised particles boxed out from the 2SDR-denoised micrographs allow subsequent structural analysis to reach a high-quality 3D density map. This demonstrates that the high resolution information can be well preserved through this 2SDR denoising strategy.


The generalized degrees of freedom of multilinear principal component analysis

Journal of Multivariate Analysis (2019), 173:26–37

Tensor data, such as image set, movie data, gene-environment interactions, or gene–gene interactions, have become a popular data format in many fields. Multilinear Principal Component Analysis (MPCA) has been recognized as an efficient dimension reduction method for tensor data analysis. However, a gratifying rank selection method for a general application of MPCA is not yet available. For example, both the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), arguably two of the most commonly used model selection methods, require more strict model assumptions when applying on the rank selection in MPCA.

In this paper, we propose a rank selection rule for MPCA based on the minimum risk criterion and Stein’s unbiased risk estimate (SURE). We derive a neat formula while using the minimum model assumptions for MPCA. It is composed of a residual sum of squares for model fitting and a penalty on the model complexity referred as the generalized degrees of freedom (GDF). We allocate each term in the GDF to either the number of parameters used in the model or the complexity in separating the signal from the noise. Compared with AIC and BIC and their modification methods, this criterion reaches higher accuracies in a thorough simulation study. Importantly, it has potential for more general application because it makes fewer model assumptions.


Deriving a sub-nanomolar affinity peptide from TAP to enable smFRET analysis of RNA polymerase II complexes

Methods (2019), 159-160:59-69

Our capability to visualize protein complexes such as RNA polymerase II (pol II) by single-molecule imaging techniques has largely been hampered by the absence of a simple bio-orthogonal approach for selective labeling with a fluorescent probe. Here, we modify the existing calmodulin-binding peptide (CBP) in the widely used Tandem Affinity Purification (TAP) tag to endow it with a high affinity for calmodulin (CaM) and use dye-CaM to conduct site-specific labeling of pol II.

To demonstrate the single molecule applicability of this approach, we labeled the C-terminus of the Rpb9 subunit of pol II with donor-CaM and a site in TFIIF with an acceptor to generate a FRET (fluorescence resonance energy transfer) pair in the pol II-TFIIF complex. We then used total internal reflection fluorescence microscopy (TIRF) with alternating excitation to measure the single molecule FRET (smFRET) efficiency between these two sites in pol II-TFIIF. We found they exhibited a proximity consistent with that observed in the transcription pre-initiation complex by cryo-electron microscopy (cryo-EM). We further compared our non-covalent labeling approach with an enzyme-enabled covalent labeling method. The virtually indistinguishable results validate our smFRET approach and show that the observed proximity between the two sites represents a hallmark of the pol II-TFIIF complex. Taken together, we present a simple and versatile bio-orthogonal method derived from TAP to enable selective labeling of a protein complex. This method is suitable for analyzing dynamic relationships among proteins involved in transcription and it can be readily extended to many other biological processes.


On the strengths of the self-updating process clustering algorithm

Journal of Statistical Computation and Simulation (2015), 86:1010-1031

The self-updating process (SUP) is a clustering algorithm that stands from the viewpoint of data points and simulates the process how data points move and perform self-clustering. It is an iterative process on the sample space and allows for both time-varying and time-invariant operators.

By simulations and comparisons, this paper shows that SUP is particularly competitive in clustering (i) data with noise, (ii) data with a large number of clusters, and (iii) unbalanced data. When noise is present in the data, SUP is able to isolate the noise data points while performing clustering simultaneously. The property of the local updating enables SUP to handle data with a large number of clusters and data of various structures. In this paper, we showed that the blurring mean-shift is a static SUP. Therefore, our discussions on the strengths of SUP also apply to the blurring mean-shift.


γ-SUP: a clustering algorithm for cryo-electron microscopy images
of asymmetric particles

Annals of Applied Statistics (2014), 8(1):259-285
pdf download (1057 downloads)

Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful tool for obtaining three-dimensional (3D) structures of biological macromolecules in native states. A minimum cryo-EM image data set for deriving a meaningful reconstruction is comprised of thousands of randomly orientated projections of

identical particles photographed with small number of electrons. The computation of 3D structure from 2D projections requires clustering, which aims to enhance the signal to noise ratio in each view by grouping similarly oriented images. Nevertheless, the prevailing clustering techniques are often compromised by three characteristics of cryo-EM data: high noise content, high dimensionality and large number of clusters. Moreover, since clustering requires registering images of similar orientation into the same pixel coordinates by 2D alignment, it is desired that the clustering algorithm can label misaligned images as outliers. Herein, we introduce a clustering algorithm γ-SUP to model the data with a q-Gaussian mixture and adopt the minimum γ-divergence for estimation, and then use a self-updating procedure to obtain the numerical solution. We apply γ-SUP to the cryo-EM images of two benchmark macromolecules, RNA polymerase II and ribosome. In the former case, simulated images were chosen to decouple clustering from alignment to demonstrate γ-SUP is more robust to misalignment outliers than the existing clustering methods used in the cryo-EM community. In the latter case, the clustering of real cryo-EM data by our γ-SUP method eliminates noise in many views to reveal true structure features of ribosome at the projection level.


Robust independent component analysis via minimum
divergence estimation

IEEE Journal of Selected Topics in Signal Processing (2013) 7 (4): 614-624
pdf download (1498 downloads)

Independent component analysis (ICA) has been shown to be useful in many applications. However, most ICA methods are sensitive to data contamination and outliers. In this article we introduce a general minimum U-divergence framework for ICA, which covers some standard ICA methods as special cases.

Within the U-family we further focus on the γ-divergence due to its desirable property of super robustness, which gives the proposed method γ-ICA. Statistical properties and technical conditions for the consistency of γ-ICA are rigorously studied. In the limiting case, it leads to a necessary and sufficient condition for the consistency of MLE-ICA. This necessary and sufficient condition is weaker than the condition known in the literature. Since the parameter of interest in ICA is an orthogonal matrix, a geometrical algorithm based on gradient flows on special orthogonal group is introduced to implement γ-ICA. Furthermore, a data-driven selection for the value, which is critical to the achievement of γ-ICA, is developed. The performance, especially the robustness, of γ-ICA in comparison with standard ICA methods is demonstrated through experimental studies using simulated data and image data.


On multilinear principal component analysis of order-two tensors

Biometrika (2012) 99 (3): 569-583
pdf download (295 downloads)

Principal component analysis is commonly used for dimension reduction in analyzing high dimensional data. Multilinear principal component analysis aims to serve a similar function for analyzing tensor structure data, and has empirically been shown effective in reducing dimensionality. In this paper, we investigate its

statistical properties and demonstrate its advantages. Conventional principal component analysis, which vectorizes the tensor data, may lead to inefficient and unstable prediction due to the often extremely large dimensionality involved. Multilinear principal component analysis, in trying to preserve the data structure, searches for low-dimensional projections and, thereby, decreases dimensionality more efficiently. Asymptotic theory of order-two multilinear principal component analysis, including asymptotic efficiency and distributions of principal components, associated projections, and the explained variance, is developed. A test of dimensionality is also proposed. Finally, multilinear principal component analysis is shown to improve conventional principal component analysis in analyzing the Olivetti faces data set, which is achieved by extracting a more modularly-oriented basis set in reconstructing the test faces.


Regulation of mammalian transcription by Gdown1 through
a novel steric crosstalk revealed by cryo-EM

EMBO Journal (2012) 31 (17): 3575-87

In mammals, a distinct RNA polymerase II form, RNAPII(G) contains a novel subunit Gdown1 (encoded by POLR2M), which represses gene activation, only to be reversed by the multisubunit Mediator co-activator. Here, we employed single-particle cryo-electron microscopy (cryo-EM) to disclose the architectures of

RNAPII(G), RNAPII and RNAPII in complex with the transcription initiation factor TFIIF, all to B19A°. Difference analysis mapped Gdown1 mostly to the RNAPII Rpb5 shelf-Rpb1 jaw, supported by antibody labelling experiments. These structural features correlate with the moderate increase in the efficiency of RNA chain elongation by RNAP II(G). In addition, our updated RNAPII–TFIIF map showed that TFIIF tethers multiple regions surrounding the DNA-binding cleft, in agreement with cross-linking and biochemical mapping. Gdown1’s binding sites overlap extensively with those of TFIIF, with Gdown1 sterically excluding TFIIF from RNAPII, herein demonstrated by competition assays using size exclusion chromatography. In summary, our work establishes a structural basis for Gdown1 impeding initiation at promoters, by obstruction of TFIIF, accounting for an additional dependent role of Mediator in activated transcription.


Toward automated denoising of single molecular Förster
resonance energy transfer data

Journal of Biomedical Optics (2012) 17 (1): 011007
pdf download (257 downloads)

A wide-field two-channel fluorescence microscope is a powerful tool as it allows for the study of conformation dynamics of hundreds to thousands of immobilized single molecules by Förster resonance energy transfer (FRET) signals. To date, the data reduction from a movie to a final set containing meaningful

single-molecule FRET (smFRET) traces involves human inspection and intervention at several critical steps, greatly hampering the efficiency at the post-imaging stage. To facilitate the data reduction from smFRET movies to smFRET traces and to address the noise-limited issues, we developed a statistical denoising system toward fully automated processing. This data reduction system has embedded several novel approaches. First, as to background subtraction, high-order singular value decomposition (HOSVD) method is employed to extract spatial and temporal features. Second, to register and map the two color channels, the spots representing bleeding through the donor channel to the acceptor channel are used. Finally, correlation analysis and likelihood ratio statistic for the change point detection (CPD) are developed to study the two channels simultaneously, resolve FRET states, and report the dwelling time of each state. The performance of our method has been checked using both simulation and real data.


Zernike phase plate cryoelectron microscopy facilitates single
particle analysis of unstained asymmetric protein complexes

Structure (2010) 18 (1): 17-27
pdf download (280 downloads)

Single particle reconstruction from cryoelectron microscopy images, though emerging as a powerful means in structural biology, is faced with challenges as applied to asymmetric proteins smaller than megadaltons due to low contrast. Zernike phase plate can improve the contrast by restoring the microscope

contrast transfer function. Here, by exploiting simulated Zernike and conventional defocused cryoelectron microscope images with noise characteristics comparable to those of experimental data, we quantified the efficiencies of the steps in single particle analysis of ice-embedded RNA polymerase II (500 kDa), transferrin receptor complex (290 kDa), and T7 RNA polymerase lysozyme (100 kDa). Our results show Zernike phase plate imaging is more effective as to particle identification and also sorting of orientations, conformations, and compositions. Moreover, our analysis on image alignment indicates that Zernike phase plate can, in principle, reduce the number of particles required to attain near atomic resolution by 10–100 fold for proteins between 100 kDa and 500 kDa.