Development and Benchmarking of Imputation Methods for Micriobome and Single-cell Sequencing Data

Development and Benchmarking of Imputation Methods for Micriobome and Single-cell Sequencing Data

Author: Ruochen Jiang

Publisher:

Published: 2021

Total Pages: 175

ISBN-13:

DOWNLOAD EBOOK

Next generation sequencing (NGS) has revolutionized biomedical research and has a broad impact and applications. Since its advent around 15 years ago, this high scalable DNA sequencing technology has generated numerous biological data with new features and brought new challenges to data analysis. For example, researchers utilize RNA sequencing (RNA-seq) technology to more accurately quantify the gene expression levels. However, the NGS technology involves many processing steps and technical variations when measuring the expression values in the biological samples. In other words, the NGS data researchers observed could be biased due to the randomness and constraints in the NGS technology. This dissertation will mainly focus on microbiome sequencing data and single-cell RNA-seq (scRNA-seq) data. Both of them are highly sparse matrix-form count data. The zeros could either be biological or non-biological, and the high sparsity in the data have brought challenges to data analysis. Missing data imputation problem has been studied in statistics and social science as the survey data often experience non-response to some of the survey questions and those unresponded questions will be marked as "NA" or missing values in the data. Imputation methods are used to provide a sophisticated guess for the missing values, and the purpose is to avoid discarding the collected samples and for the ease of using the state-of-the-art statistical methods. In machine learning, the famous Netflix data challenge regarding film recommendation system also falls into the missing data imputation problem category. Netflix wants to find a way to predict users' fondness of the movies they have not watched. The potential scores these users would give to the unwatched films are regarded as missing values in the data. NGS data imputation problem is different from the previous two cases in that the missing values in the NGS data are not so well-defined. The zeros in the NGS data could either come from the biological origin (should not be regarded as missing values) or non-biological origin (due to the limitation of the sequencing technology and should be regarded as missing values). The size (number of samples and features) of the NGS matrix data is usually larger than the size of survey data but smaller than the size of the recommendation system data. In addition, in most cases, the percentage of missing values in the survey data is less than the percentage of zeros in the NGS data, and the missing values in the film recommendation system data have the highest percentage (> 99.9%). As a result, the commonly used missing data imputation methods in statistics and machine learning are not directly applicable to NGS data. In recent years, numerous imputation methods have been proposed to deal with the highly sparse scRNA-seq data. In light of this, this dissertation aims to address two questions. First, the microbiome sequencing data, having additional information comparing to the scRNA-seq data, lacks an imputation method. Secondly, whether to use imputation or not in scRNA-seq data analysis is still a controversial problem. The first part of this dissertation focuses on the first imputation method developed for the microbiome sequencing data: mbImpute. Microbiome studies have gained increased attention since many discoveries revealed connections between human microbiome compositions and diseases. A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data---mbImpute---to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. Comprehensive simulations verify that mbImpute achieves better imputation accuracy under multiple metrics, compared with five state-of-the-art imputation methods designed for non-microbiome data. In real data applications, we demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances. The second part of this dissertation focuses on how to deal with high sparsity in the scRNA-seq data. ScRNA-seq technologies have revolutionized biomedical sciences by enabling genome-wide profiling of gene expression levels at an unprecedented single-cell resolution. A distinct characteristic of scRNA-seq data is the vast proportion of zeros unseen in bulk RNA-seq data. Researchers view these zeros differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as false signals or missing data to be corrected. As a result, the scRNA-seq field faces much controversy regarding how to handle zeros in data analysis. We first discuss the sources of biological and non-biological zeros in scRNA-seq data. Second, we evaluate the impacts of non-biological zeros on cell clustering and differential gene expression analysis. Third, we summarize the advantages, disadvantages, and suitable users of three input data types: observed counts, imputed counts, and binarized counts and evaluate the performance of downstream analysis on these three input data types. Finally, we discuss the open questions regarding non-biological zeros, the need for benchmarking, and the importance of transparent analysis.


Compositional Data Analysis

Compositional Data Analysis

Author: Vera Pawlowsky-Glahn

Publisher: John Wiley & Sons

Published: 2011-09-19

Total Pages: 405

ISBN-13: 0470711353

DOWNLOAD EBOOK

It is difficult to imagine that the statistical analysis of compositional data has been a major issue of concern for more than 100 years. It is even more difficult to realize that so many statisticians and users of statistics are unaware of the particular problems affecting compositional data, as well as their solutions. The issue of ``spurious correlation'', as the situation was phrased by Karl Pearson back in 1897, affects all data that measures parts of some whole, such as percentages, proportions, ppm and ppb. Such measurements are present in all fields of science, ranging from geology, biology, environmental sciences, forensic sciences, medicine and hydrology. This book presents the history and development of compositional data analysis along with Aitchison's log-ratio approach. Compositional Data Analysis describes the state of the art both in theoretical fields as well as applications in the different fields of science. Key Features: Reflects the state-of-the-art in compositional data analysis. Gives an overview of the historical development of compositional data analysis, as well as basic concepts and procedures. Looks at advances in algebra and calculus on the simplex. Presents applications in different fields of science, including, genomics, ecology, biology, geochemistry, planetology, chemistry and economics. Explores connections to correspondence analysis and the Dirichlet distribution. Presents a summary of three available software packages for compositional data analysis. Supported by an accompanying website featuring R code. Applied scientists working on compositional data analysis in any field of science, both in academia and professionals will benefit from this book, along with graduate students in any field of science working with compositional data.


Environmental Chemicals, the Human Microbiome, and Health Risk

Environmental Chemicals, the Human Microbiome, and Health Risk

Author: National Academies of Sciences, Engineering, and Medicine

Publisher: National Academies Press

Published: 2018-03-01

Total Pages: 123

ISBN-13: 0309468698

DOWNLOAD EBOOK

A great number of diverse microorganisms inhabit the human body and are collectively referred to as the human microbiome. Until recently, the role of the human microbiome in maintaining human health was not fully appreciated. Today, however, research is beginning to elucidate associations between perturbations in the human microbiome and human disease and the factors that might be responsible for the perturbations. Studies have indicated that the human microbiome could be affected by environmental chemicals or could modulate exposure to environmental chemicals. Environmental Chemicals, the Human Microbiome, and Health Risk presents a research strategy to improve our understanding of the interactions between environmental chemicals and the human microbiome and the implications of those interactions for human health risk. This report identifies barriers to such research and opportunities for collaboration, highlights key aspects of the human microbiome and its relation to health, describes potential interactions between environmental chemicals and the human microbiome, reviews the risk-assessment framework and reasons for incorporating chemicalâ€"microbiome interactions.


Single-Cell Genomics

Single-Cell Genomics

Author: Parwinder Kaur

Publisher: Springer

Published: 2025-06-13

Total Pages: 0

ISBN-13: 9783030409500

DOWNLOAD EBOOK

Cells, the basic units of biological structure and function, vary broadly in type and state. Individual cells are the building blocks of tissues, organs, and organisms. Each tissue contains cells of many types, and cells of each type can switch among biological states. Single-cell genomics, transcriptomics and epigenomics open a whole new era with the possibility to interrogate every cell of an organism in order to decipher the important biological processes that occur within. This has emerged as a ground-breaking technology that has greatly enhanced our understanding of the complexity of gene expression dynamics at a microscopic resolution. It is anticipated that in the next 5-10 years, the wider research community will be routinely employing this powerful technology as a laboratory staple. Single-cell genomics, transcriptomics and epigenomics hold the potential to revolutionize the way we characterize complex cell assemblies and study their spatial organization, dynamics, clonal distribution, pathways, function, and crosstalks. These fascinating advances have opened up a new field of cell population genomics. Single-cell genomics, transcriptomics and epigenomics research is providing new insights into inter-cellular population genomic diversity, heterogeneity, specialization, taxonomy, spatial and temporal gene regulation, and cellular and organismal development and evolution. It is facilitating plant breeding, understanding of human disease conditions and personalized medicine. This book discusses the perspectives, progress, and promises of single-cell genomics, transcriptomics and epigenomics research and applications in addressing the above and other key biological aspects in all organisms. It establishes the current state-of-the-field and serves as the foundation for future developments in single-cell genomics, transcriptomics, and epigenomics.


Single-cell Sequencing and Methylation

Single-cell Sequencing and Methylation

Author: Buwei Yu

Publisher: Springer Nature

Published: 2020-09-19

Total Pages: 247

ISBN-13: 9811544948

DOWNLOAD EBOOK

With the rapid development of biotechnologies, single-cell sequencing has become an important tool for understanding the molecular mechanisms of diseases, defining cellular heterogeneities and characteristics, and identifying intercellular communications and single-cell-based biomarkers. Providing a clear overview of the clinical applications, the book presents state-of-the-art information on immune cell function, cancer progression, infection, and inflammation gained from single-cell DNA or RNA sequencing. Furthermore, it explores the role of target gene methylation in the pathogenesis of diseases, with a focus on respiratory cancer, infection and chronic diseases. As such it is a valuable resource for clinical researchers and physicians, allowing them to refresh their knowledge and improve early diagnosis and therapy for patients.


Statistical and Computational Methods for Single-cell Transcriptome Sequencing and Metagenomics

Statistical and Computational Methods for Single-cell Transcriptome Sequencing and Metagenomics

Author: Fanny Perraudeau

Publisher:

Published: 2018

Total Pages: 246

ISBN-13:

DOWNLOAD EBOOK

I propose statistical methods and software for the analysis of single-cell transcriptome sequencing (scRNA-seq) and metagenomics data. Specifically, I present a general and flexible zero-inflated negative binomial-based wanted variation extraction (ZINB-WaVE) method, which extracts low-dimensional signal from scRNA-seq read counts, accounting for zero inflation (dropouts), over-dispersion, and the discrete nature of the data. Additionally, I introduce an application of the ZINB-WaVE method that identifies excess zero counts and generates gene and cell-specific weights to unlock bulk RNA-seq differential expression pipelines for zero-inflated data, boosting performance for scRNA-seq analysis. Finally, I present a method to estimate bacterial abundances in human metagenomes using full-length 16S sequencing reads.


The New Science of Metagenomics

The New Science of Metagenomics

Author: National Research Council

Publisher: National Academies Press

Published: 2007-06-24

Total Pages: 170

ISBN-13: 0309106761

DOWNLOAD EBOOK

Although we can't usually see them, microbes are essential for every part of human life-indeed all life on Earth. The emerging field of metagenomics offers a new way of exploring the microbial world that will transform modern microbiology and lead to practical applications in medicine, agriculture, alternative energy, environmental remediation, and many others areas. Metagenomics allows researchers to look at the genomes of all of the microbes in an environment at once, providing a "meta" view of the whole microbial community and the complex interactions within it. It's a quantum leap beyond traditional research techniques that rely on studying-one at a time-the few microbes that can be grown in the laboratory. At the request of the National Science Foundation, five Institutes of the National Institutes of Health, and the Department of Energy, the National Research Council organized a committee to address the current state of metagenomics and identify obstacles current researchers are facing in order to determine how to best support the field and encourage its success. The New Science of Metagenomics recommends the establishment of a "Global Metagenomics Initiative" comprising a small number of large-scale metagenomics projects as well as many medium- and small-scale projects to advance the technology and develop the standard practices needed to advance the field. The report also addresses database needs, methodological challenges, and the importance of interdisciplinary collaboration in supporting this new field.