Statistical Analysis of RNA Sequencing Count Data

Statistical Analysis of RNA Sequencing Count Data

Author: Gu Mi

Publisher:

Published: 2014

Total Pages: 141

ISBN-13:

DOWNLOAD EBOOK

RNA-Sequencing (RNA-Seq) has rapidly become the de facto technique in transcriptome studies. However, established statistical methods for analyzing experimental and observational microarray studies need to be revised or completely re-invented to accommodate RNA-Seq data's unique characteristics. In this dissertation, we focus on statistical analyses performed at two particular stages in the RNA-Seq pipeline, namely, regression analysis of gene expression levels including tests for differential expression (DE) and the downstream Gene Ontology (GO) enrichment analysis. The negative binomial (NB) distribution has been widely adopted to model RNA-Seq read counts for its flexibility in accounting for any extra-Poisson variability. Because of the relatively small number of samples in a typical RNA-Seq experiment, power-saving strategies include assuming some commonalities of the NB dispersion parameters across genes, via simple models relating them to mean expression rates. Many such NB dispersion models have been proposed, but there is limited research on evaluating model adequacy. We propose a simulation-based goodness-of- t (GOF) test with diagnostic graphics to assess the NB assumption for a single gene via parametric bootstrap and empirical probability plots, and assess the adequacy of NB dispersion models by combining individual GOF test p-values from all genes. Our simulation studies and real data analyses suggest the NB assumption is valid for modeling a gene's read counts, and provide evidence on how NB dispersion models differ in capturing the variation in the dispersion. It is not well understood to what degree a dispersion-modeling approach can still be useful when a fitted dispersion model captures a significant part, but not all, of the variation in the dispersion. As a further step towards understanding the power-robustness trade-offs of NB dispersion models, we propose a simple statistic to quantify the inadequacy of a fitted NB dispersion model. Subsequent power-robustness analyses are guided by this estimated residual dispersion variation and other controlling factors estimated from real RNA-Seq datasets. The proposed measure for quantifying residual dispersion variation gives hints on whether we can gain statistical power by a dispersion-modeling approach. Our real-databased simulations also provide benchmarking investigations into the power and robustness properties of the many NB dispersion methods in current RNA-Seq community. For statistical tests of enriched GO categories, which aim to relate the outcome of DE analysis to biological functions, the transcript length becomes a confounding factor as it correlates with both the GO membership and the significance of the DE test. We propose to adjust for such bias using the logistic regression and incorporate the length as a covariate. The use of continuous measures of differential expression via transformations of DE test p-values also avoids the subjective specification of a p-value threshold adopted by contingency-table-based approaches. Simulation and real data examples indicate that enriched categories no longer favor longer transcripts after the adjustment, which justifies the effectiveness of our proposed method.


Statistical Analysis of Next Generation Sequencing Data

Statistical Analysis of Next Generation Sequencing Data

Author: Somnath Datta

Publisher: Springer

Published: 2014-07-03

Total Pages: 438

ISBN-13: 3319072129

DOWNLOAD EBOOK

Next Generation Sequencing (NGS) is the latest high throughput technology to revolutionize genomic research. NGS generates massive genomic datasets that play a key role in the big data phenomenon that surrounds us today. To extract signals from high-dimensional NGS data and make valid statistical inferences and predictions, novel data analytic and statistical techniques are needed. This book contains 20 chapters written by prominent statisticians working with NGS data. The topics range from basic preprocessing and analysis with NGS data to more complex genomic applications such as copy number variation and isoform expression detection. Research statisticians who want to learn about this growing and exciting area will find this book useful. In addition, many chapters from this book could be included in graduate-level classes in statistical bioinformatics for training future biostatisticians who will be expected to deal with genomic data in basic biomedical research, genomic clinical trials and personalized medicine. About the editors: Somnath Datta is Professor and Vice Chair of Bioinformatics and Biostatistics at the University of Louisville. He is Fellow of the American Statistical Association, Fellow of the Institute of Mathematical Statistics and Elected Member of the International Statistical Institute. He has contributed to numerous research areas in Statistics, Biostatistics and Bioinformatics. Dan Nettleton is Professor and Laurence H. Baker Endowed Chair of Biological Statistics in the Department of Statistics at Iowa State University. He is Fellow of the American Statistical Association and has published research on a variety of topics in statistics, biology and bioinformatics.


RNA-seq Data Analysis

RNA-seq Data Analysis

Author: Eija Korpelainen

Publisher: CRC Press

Published: 2014-09-19

Total Pages: 322

ISBN-13: 1466595019

DOWNLOAD EBOOK

The State of the Art in Transcriptome AnalysisRNA sequencing (RNA-seq) data offers unprecedented information about the transcriptome, but harnessing this information with bioinformatics tools is typically a bottleneck. RNA-seq Data Analysis: A Practical Approach enables researchers to examine differential expression at gene, exon, and transcript le


Gene Expression Data Analysis

Gene Expression Data Analysis

Author: Pankaj Barah

Publisher: CRC Press

Published: 2021-11-08

Total Pages: 276

ISBN-13: 1000425754

DOWNLOAD EBOOK

Development of high-throughput technologies in molecular biology during the last two decades has contributed to the production of tremendous amounts of data. Microarray and RNA sequencing are two such widely used high-throughput technologies for simultaneously monitoring the expression patterns of thousands of genes. Data produced from such experiments are voluminous (both in dimensionality and numbers of instances) and evolving in nature. Analysis of huge amounts of data toward the identification of interesting patterns that are relevant for a given biological question requires high-performance computational infrastructure as well as efficient machine learning algorithms. Cross-communication of ideas between biologists and computer scientists remains a big challenge. Gene Expression Data Analysis: A Statistical and Machine Learning Perspective has been written with a multidisciplinary audience in mind. The book discusses gene expression data analysis from molecular biology, machine learning, and statistical perspectives. Readers will be able to acquire both theoretical and practical knowledge of methods for identifying novel patterns of high biological significance. To measure the effectiveness of such algorithms, we discuss statistical and biological performance metrics that can be used in real life or in a simulated environment. This book discusses a large number of benchmark algorithms, tools, systems, and repositories that are commonly used in analyzing gene expression data and validating results. This book will benefit students, researchers, and practitioners in biology, medicine, and computer science by enabling them to acquire in-depth knowledge in statistical and machine-learning-based methods for analyzing gene expression data. Key Features: An introduction to the Central Dogma of molecular biology and information flow in biological systems A systematic overview of the methods for generating gene expression data Background knowledge on statistical modeling and machine learning techniques Detailed methodology of analyzing gene expression data with an example case study Clustering methods for finding co-expression patterns from microarray, bulkRNA, and scRNA data A large number of practical tools, systems, and repositories that are useful for computational biologists to create, analyze, and validate biologically relevant gene expression patterns Suitable for multidisciplinary researchers and practitioners in computer science and the biological sciences


Statistical Analysis of Microbiome Data with R

Statistical Analysis of Microbiome Data with R

Author: Yinglin Xia

Publisher: Springer

Published: 2018-10-06

Total Pages: 518

ISBN-13: 9811315345

DOWNLOAD EBOOK

This unique book addresses the statistical modelling and analysis of microbiome data using cutting-edge R software. It includes real-world data from the authors’ research and from the public domain, and discusses the implementation of R for data analysis step by step. The data and R computer programs are publicly available, allowing readers to replicate the model development and data analysis presented in each chapter, so that these new methods can be readily applied in their own research. The book also discusses recent developments in statistical modelling and data analysis in microbiome research, as well as the latest advances in next-generation sequencing and big data in methodological development and applications. This timely book will greatly benefit all readers involved in microbiome, ecology and microarray data analyses, as well as other fields of research.


Computational Genomics with R

Computational Genomics with R

Author: Altuna Akalin

Publisher: CRC Press

Published: 2020-12-16

Total Pages: 462

ISBN-13: 1498781861

DOWNLOAD EBOOK

Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology. After reading: You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages. You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data. You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation. You will know the basics of processing and quality checking high-throughput sequencing data. You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites. You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization. You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq. You will know basic techniques for integrating and interpreting multi-omics datasets. Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.


A Comparison of Statistical Models for Correlated Over-dispersed Count Data

A Comparison of Statistical Models for Correlated Over-dispersed Count Data

Author: Elizabeth Anne Wynn

Publisher:

Published: 2018

Total Pages: 68

ISBN-13:

DOWNLOAD EBOOK

As the cost of RNA-sequencing (RNA-Seq) decreases, it becomes increasingly feasible to collect RNA-Seq data under complex study designs, including paired, longitudinal, and other correlated designs. Commonly used RNA-Seq analysis tools do not allow for correlation between observations, which is common in these types of studies. When applying statistical methods with mechanisms to account for correlated data to RNA-Seq experiments, extra considerations must be made because RNA-Seq experiments include data on 10,000 to 20,000 genes, resulting in a large number of statistical models and tests. Thus, in this setting achieving model convergence for all genes and maintaining nominal type 1 error and false discovery rates can be problematic. Furthermore, RNA-Seq data are over-dispersed counts, and so analysis methods must also account for the non-normality of the data. In this study we evaluate the utility of several common statistical methods for correlated, over-dispersed count data in the context of RNA-Seq experiments via a simulation study and application to a longitudinal RNA-Seq dataset. The methods compared include generalized estimating equations, generalized linear mixed models, and linear mixed models after taking a normalizing transformation of the count data. We also compare these methods to popular approaches for analyzing RNA-Seq data using the edgeR, DESeq2 and limma packages in R. Additionally, for each method we explore the use of several degrees of freedom approximations used in significance testing. Finally, recommendations as to which methods are most appropriate under various circumstances are provided.


Bioinformatics and Computational Biology Solutions Using R and Bioconductor

Bioinformatics and Computational Biology Solutions Using R and Bioconductor

Author: Robert Gentleman

Publisher: Springer Science & Business Media

Published: 2005-12-29

Total Pages: 478

ISBN-13: 0387293620

DOWNLOAD EBOOK

Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.


Statistical Methods for the Analysis of RNA Sequencing Data

Statistical Methods for the Analysis of RNA Sequencing Data

Author: Man-Kee Maggie Chu

Publisher:

Published: 2014

Total Pages: 340

ISBN-13:

DOWNLOAD EBOOK

The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are di erent because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. Functional clustering is an important method for examining gene expression patterns and thus discovering co-expressed genes to better understand the biological systems. Clusteringbased approaches to analyze repeated digital gene expression measures are in demand. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. The e ectiveness of the proposed clustering method is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Due to the complexity and size of genomic data, the choice of good starting values is an important issue to the proposed clustering algorithm. There is a need for a reliable initialization strategy for cluster-wise regression specifically for time-course discrete count data. We modify existing common initialization procedures to suit our model-based clustering algorithm and the procedures are evaluated through a simulation study on artificial datasets and are applied to real genomic examples to identify the optimal initialization method. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach.


Algorithms for Minimization Without Derivatives

Algorithms for Minimization Without Derivatives

Author: Richard P. Brent

Publisher: Courier Corporation

Published: 2013-06-10

Total Pages: 210

ISBN-13: 0486143686

DOWNLOAD EBOOK

DIVOutstanding text for graduate students and research workers proposes improvements to existing algorithms, extends their related mathematical theories, and offers details on new algorithms for approximating local and global minima. /div