Statistical Simulation and Analysis of Single-cell RNA-seq Data

Statistical Simulation and Analysis of Single-cell RNA-seq Data

Author: Tianyi Sun

Publisher:

Published: 2023

Total Pages: 0

ISBN-13:

DOWNLOAD EBOOK

The recent development of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies by revealing the genome-wide gene expression levels within individual cells. In contrast to bulk RNA sequencing, scRNA-seq technology captures cell-specific transcriptome landscapes, which can reveal crucial information about cell-to-cell heterogeneity across different tissues, organs, and systems and enable the discovery of novel cell types and new transient cell states. According to search results from PubMed, from 2009-2023, over 5,000 published studies have generated datasets using this technology. Such large volumes of data call for high-quality statistical methods for their analysis. In the three projects of this dissertation, I have explored and developed statistical methods to model the marginal and joint gene expression distributions and determine the latent structure type for scRNA-seq data. In all three projects, synthetic data simulation plays a crucial role. My first project focuses on the exploration of the Beta-Poisson hierarchical model for the marginal gene expression distribution of scRNA-seq data. This model is a simplified mechanistic model with biological interpretations. Through data simulation, I demonstrate three typical behaviors of this model under different parameter combinations, one of which can be interpreted as one source of the sparsity and zero inflation that is often observed in scRNA-seq datasets. Further, I discuss parameter estimation methods of this model and its other applications in the analysis of scRNA-seq data. My second project focuses on the development of a statistical simulator, scDesign2, to generate realistic synthetic scRNA-seq data. Although dozens of simulators have been developed before, they lack the capacity to simultaneously achieve the following three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, scDesign2 is developed as a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs. My third project focuses on deciding latent structure types for scRNA-seq datasets. Clustering and trajectory inference are two important data analysis tasks that can be performed for scRNA-seq datasets and will lead to different interpretations. However, as of now, there is no principled way to tell which one of these two types of analysis results is more suitable to describe a given dataset. In this project, we propose two computational approaches that aim to distinguish cluster-type vs. trajectory-type scRNA-seq datasets. The first approach is based on building a classifier using eigenvalue features of the gene expression covariance matrix, drawing inspiration from random matrix theory (RMT). The second approach is based on comparing the similarity of real data and simulated data generated by assuming the cell latent structure as clusters or a trajectory. While both approaches have limitations, we show that the second approach gives more promising results and has room for further improvements.


Statistical Methods for Bulk and Single-cell RNA Sequencing Data

Statistical Methods for Bulk and Single-cell RNA Sequencing Data

Author: Wei Li

Publisher:

Published: 2019

Total Pages: 207

ISBN-13:

DOWNLOAD EBOOK

Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies on bulk tissues. Recently, the emerging single-cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at a single-cell resolution, providing a chance to characterize stochastic heterogeneity within a cell population. The analysis of bulk and single-cell RNA-seq data at four different levels (samples, genes, transcripts, and exons) involves multiple statistical and computational questions, some of which remain challenging up to date. The first part of this dissertation focuses on the statistical challenges in the transcript-level analysis of bulk RNA-seq data. The next-generation RNA-seq technologies have been widely used to assess full-length RNA isoform structure and abundance in a high-throughput manner, enabling us to better understand the alternative splicing process and transcriptional regulation mechanism. However, accurate isoform identification and quantification from RNA-seq data are challenging due to the information loss in sequencing experiments. In Chapter 2, given the fast accumulation of multiple RNA-seq datasets from the same biological condition, we develop a statistical method, MSIQ, to achieve more accurate isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. The MSIQ method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples and allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy of MSIQ compared with alternative methods through both simulation and real data studies. In Chapter 3, we introduce a novel method, AIDE, the first approach that directly controls false isoform discoveries by implementing the statistical model selection principle. Solving the isoform discovery problem in a stepwise manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. Our results demonstrate that AIDE has the highest precision compared to the state-of-the-art methods, and it is able to identify isoforms with biological functions in pathological conditions. The second part of this dissertation discusses two statistical methods to improve scRNA-seq data analysis, which is complicated by the excess missing values, the so-called dropouts due to low amounts of mRNA sequenced within individual cells. In Chapter 5, we introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. The scImpute method automatically identifies likely dropouts, and only performs imputation on these values by borrowing information across similar cells. Evaluation based on both simulated and real scRNA-seq data suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropouts, enhance the clustering of cell subpopulations, and improve the accuracy of differential expression analysis. In Chapter 6, we propose a flexible and robust simulator, scDesign, to optimize the choices of sequencing depth and cell number in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. It is the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings.


Statistical Methods for RNA-sequencing Data

Statistical Methods for RNA-sequencing Data

Author: Rhonda Bacher

Publisher:

Published: 2017

Total Pages: 0

ISBN-13:

DOWNLOAD EBOOK

Major methodological and technological advances in sequencing have inspired ambitious biological questions that were previously elusive. Addressing such questions with novel and complex data requires statistically rigorous tools. In this dissertation, I develop, evaluate, and apply statistical and computational methods for analysis of high-throughput sequencing data. A unifying theme of this work is that all these methods are aimed at RNA-seq data. The first method focuses on characterizing gene expression in RNA-seq experiments with ordered conditions. The second focuses on single-cell RNA-seq data, where we develop a method for normalization to account for a previously unknown technical artifact in the data. Finally, we develop a simulation in order to recapitulate the source of the artifact [in silico].


Statistical Methods for the Analysis of Genomic Data

Statistical Methods for the Analysis of Genomic Data

Author: Hui Jiang

Publisher: MDPI

Published: 2020-12-29

Total Pages: 136

ISBN-13: 3039361406

DOWNLOAD EBOOK

In recent years, technological breakthroughs have greatly enhanced our ability to understand the complex world of molecular biology. Rapid developments in genomic profiling techniques, such as high-throughput sequencing, have brought new opportunities and challenges to the fields of computational biology and bioinformatics. Furthermore, by combining genomic profiling techniques with other experimental techniques, many powerful approaches (e.g., RNA-Seq, Chips-Seq, single-cell assays, and Hi-C) have been developed in order to help explore complex biological systems. As a result of the increasing availability of genomic datasets, in terms of both volume and variety, the analysis of such data has become a critical challenge as well as a topic of great interest. Therefore, statistical methods that address the problems associated with these newly developed techniques are in high demand. This book includes a number of studies that highlight the state-of-the-art statistical methods for the analysis of genomic data and explore future directions for improvement.


Statistical Methods for Whole Transcriptome Sequencing

Statistical Methods for Whole Transcriptome Sequencing

Author: Cheng Jia

Publisher:

Published: 2017

Total Pages: 0

ISBN-13:

DOWNLOAD EBOOK

RNA-Sequencing (RNA-Seq) has enabled detailed unbiased profiling of whole transcriptomes with incredible throughput. Recent technological breakthroughs have pushed back the frontiers of RNA expression measurement to single-cell level (scRNA-Seq). With both bulk and single-cell RNA-Seq analyses, modeling of the noise structure embedded in the data is crucial for drawing correct inference. In this dissertation, I developed a series of statistical methods to account for the technical variations specific in RNA-Seq experiments in the context of isoform- or gene- level differential expression analyses. In the first part of my dissertation, I developed MetaDiff (https://github.com/jiach/MetaDiff ), a random-effects meta-regression model, that allows the incorporation of uncertainty in isoform expression estimation in isoform differential expression analysis. This framework was further extended to detect splicing quantitative trait loci with RNA-Seq data. In the second part of my dissertation, I developed TASC (Toolkit for Analysis of Single-Cell data; https://github.com/scrna-seq/TASC), a hierarchical mixture model, to explicitly adjust for cell-to-cell technical differences in scRNA-Seq analysis using an empirical Bayes approach. This framework can be adapted to perform differential gene expression analysis. In the third part of my dissertation, I developed, TASC-B, a method extended from TASC to model transcriptional bursting- induced zero-inflation. This model can identify and test for the difference in the level of transcriptional bursting. Compared to existing methods, these new tools that I developed have been shown to better control the false discovery rate in situations where technical noise cannot be ignored. They also display superior power in both our simulation studies and real world applications.


Statistical Modeling in Biomedical Research

Statistical Modeling in Biomedical Research

Author: Yichuan Zhao

Publisher: Springer Nature

Published: 2020-03-19

Total Pages: 495

ISBN-13: 3030334163

DOWNLOAD EBOOK

This edited collection discusses the emerging topics in statistical modeling for biomedical research. Leading experts in the frontiers of biostatistics and biomedical research discuss the statistical procedures, useful methods, and their novel applications in biostatistics research. Interdisciplinary in scope, the volume as a whole reflects the latest advances in statistical modeling in biomedical research, identifies impactful new directions, and seeks to drive the field forward. It also fosters the interaction of scholars in the arena, offering great opportunities to stimulate further collaborations. This book will appeal to industry data scientists and statisticians, researchers, and graduate students in biostatistics and biomedical science. It covers topics in: Next generation sequence data analysis Deep learning, precision medicine, and their applications Large scale data analysis and its applications Biomedical research and modeling Survival analysis with complex data structure and its applications.


Statistical Analysis of RNA Sequencing Count Data

Statistical Analysis of RNA Sequencing Count Data

Author: Gu Mi

Publisher:

Published: 2014

Total Pages: 141

ISBN-13:

DOWNLOAD EBOOK

RNA-Sequencing (RNA-Seq) has rapidly become the de facto technique in transcriptome studies. However, established statistical methods for analyzing experimental and observational microarray studies need to be revised or completely re-invented to accommodate RNA-Seq data's unique characteristics. In this dissertation, we focus on statistical analyses performed at two particular stages in the RNA-Seq pipeline, namely, regression analysis of gene expression levels including tests for differential expression (DE) and the downstream Gene Ontology (GO) enrichment analysis. The negative binomial (NB) distribution has been widely adopted to model RNA-Seq read counts for its flexibility in accounting for any extra-Poisson variability. Because of the relatively small number of samples in a typical RNA-Seq experiment, power-saving strategies include assuming some commonalities of the NB dispersion parameters across genes, via simple models relating them to mean expression rates. Many such NB dispersion models have been proposed, but there is limited research on evaluating model adequacy. We propose a simulation-based goodness-of- t (GOF) test with diagnostic graphics to assess the NB assumption for a single gene via parametric bootstrap and empirical probability plots, and assess the adequacy of NB dispersion models by combining individual GOF test p-values from all genes. Our simulation studies and real data analyses suggest the NB assumption is valid for modeling a gene's read counts, and provide evidence on how NB dispersion models differ in capturing the variation in the dispersion. It is not well understood to what degree a dispersion-modeling approach can still be useful when a fitted dispersion model captures a significant part, but not all, of the variation in the dispersion. As a further step towards understanding the power-robustness trade-offs of NB dispersion models, we propose a simple statistic to quantify the inadequacy of a fitted NB dispersion model. Subsequent power-robustness analyses are guided by this estimated residual dispersion variation and other controlling factors estimated from real RNA-Seq datasets. The proposed measure for quantifying residual dispersion variation gives hints on whether we can gain statistical power by a dispersion-modeling approach. Our real-databased simulations also provide benchmarking investigations into the power and robustness properties of the many NB dispersion methods in current RNA-Seq community. For statistical tests of enriched GO categories, which aim to relate the outcome of DE analysis to biological functions, the transcript length becomes a confounding factor as it correlates with both the GO membership and the significance of the DE test. We propose to adjust for such bias using the logistic regression and incorporate the length as a covariate. The use of continuous measures of differential expression via transformations of DE test p-values also avoids the subjective specification of a p-value threshold adopted by contingency-table-based approaches. Simulation and real data examples indicate that enriched categories no longer favor longer transcripts after the adjustment, which justifies the effectiveness of our proposed method.


Statistical Methods in Single Cell and Spatial Transcriptomics Data

Statistical Methods in Single Cell and Spatial Transcriptomics Data

Author: Roopali Singh

Publisher:

Published: 2021

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

Single cell RNA-sequencing (scRNA-seq) allows one to study the transcriptomics of different cell types in heterogeneous samples (e.g. tissues) at a single cell level. Most scRNA-seq protocols experience high levels of dropout due to the small amount of starting material, leading to a majority of reported expression levels being zero. Though missing data contain information about reproducibility, they are often excluded in the reproducibility assessment, potentially generating misleading assessments. In the first part of my dissertation, we develop a copula-based regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors (e.g., platform or sequencing depth) when a large number of measurements are missing. Simulations show that our method is more accurate in detecting differences in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method by comparing the reproducibility of different library preparation platforms and studying the effect of sequencing depth on reproducibility, thereby determining the cost-effective sequencing depth that is required to achieve sufficient reproducibility. The spatial locations of these single cells are lost in scRNA-seq data. A recently emerging technology, Spatial Transcriptomics (ST), measures the gene expression in a tissue slice in situ, maintaining cells' spatial information in the tissue. However, they do not have a single-cell resolution but rather produce a group of potentially heterogeneous cells at each spot, which needs to be deconvolved to learn cell composition at each spot. In the second part of my dissertation, we develop a reference-free deconvolution method, based on Bayesian non-negative matrix factorization, to infer the cell type composition of each spot. Unlike the existing deconvolution methods, which all take reference-based approaches, our approach does not rely on scRNA-seq references. Simulations show that our method is more accurate in detecting the cell-type compositions than existing deconvolution techniques in case of varying spot size, heterogeneity, and imperfect single-cell reference. We illustrate the usefulness of our method using Mouse Brain Cerebellum data and Human Intestine Developmental data.


Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data

Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data

Author: Nan Xi

Publisher:

Published: 2021

Total Pages: 203

ISBN-13:

DOWNLOAD EBOOK

The large-scale, high-dimensional, and sparse single-cell RNA sequencing (scRNA-seq) data have raised great challenges in the pipeline of data analysis. A large number of statistical and machine learning methods have been developed to analyze scRNA-seq data and answer related scientific questions. Although different methods claim advantages in certain circumstances, it is difficult for users to select appropriate methods for their analysis tasks. Benchmark studies aim to provide recommendations for method selection based on an objective, accurate, and comprehensive comparison among cutting-edge methods. They can also offer suggestions for further methodological development through massive evaluations conducted on real data. In Chapter 2, we conduct the first, systematic benchmark study of nine cutting-edge computational doublet-detection methods. In scRNA-seq, doublets form when two cells are encapsulated into one reaction volume by chance. The existence of doublets, which appear as but are not real cells, is a key confounder in scRNA-seq data analysis. Computational methods have been developed to detect doublets in scRNA-seq data; however, the scRNA-seq field lacks a comprehensive benchmarking of these methods, making it difficult for researchers to choose an appropriate method for their specific analysis needs. Our benchmark study compares doublet-detection methods in terms of their detection accuracy under various experimental settings, impacts on downstream analyses, and computational efficiency. Our results show that existing methods exhibited diverse performance and distinct advantages in different aspects. In Chapter 3, we develop an R package DoubletCollection to integrate the installation and execution of different doublet-detection methods. Traditional benchmark studies can be quickly out-of-date due to their static design and the rapid growth of available methods. DoubletCollection addresses this issue in benchmarking doublet-detection methods for scRNA-seq data. DoubletCollection provides a unified interface to perform and visualize downstream analysis after doublet-detection. Additionally, we created a protocol using DoubletCollection to execute and benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field. In Chapter 4, we conduct the first comprehensive empirical study to explore the best modeling strategy for autoencoder-based imputation methods specific to scRNA-seq data. The autoencoder-based imputation method is a family of promising methods to denoise sparse scRNA-seq data; however, the design of autoencoders has not been formally discussed in the literature. Current autoencoder-based imputation methods either borrow the practice from other fields or design the model on an ad hoc basis. We find that the method performance is sensitive to the key hyperparameter of autoencoders, including architecture, activation function, and regularization. Their optimal settings on scRNA-seq are largely different from those on other data types. Our results emphasize the importance of exploring hyperparameter space in such complex and flexible methods. Our work also points out the future direction of improving current methods.