High-dimensional Discriminant Analysis and Covariance Matrix Estimation

High-dimensional Discriminant Analysis and Covariance Matrix Estimation

Author: Yilei Wu

Publisher:

Published: 2017

Total Pages: 169

ISBN-13:

DOWNLOAD EBOOK

Statistical analysis in high-dimensional settings, where the data dimension p is close to or larger than the sample size n, has been an intriguing area of research. Applications include gene expression data analysis, financial economics, text mining, and many others. Estimating large covariance matrices is an essential part of high-dimensional data analysis because of the ubiquity of covariance matrices in statistical procedures. The estimation is also a challenging part, since the sample covariance matrix is no longer an accurate estimator of the population covariance matrix in high dimensions. In this thesis, a series of matrix structures, that facilitate the covariance matrix estimation, are studied. Firstly, we develop a set of innovative quadratic discriminant rules by applying the compound symmetry structure. For each class, we construct an estimator, by pooling the diagonal elements as well as the off-diagonal elements of the sample covariance matrix, and substitute the estimator for the covariance matrix in the normal quadratic discriminant rule. Furthermore, we develop a more general rule to deal with nonnormal data by incorporating an additional data transformation. Theoretically, as long as the population covariance matrices loosely conform to the compound symmetry structure, our specialized quadratic discriminant rules enjoy low asymptotic classification error. Computationally, they are easy to implement and do not require large-scale mathematical programming. Then, we generalize the compound symmetry structure by considering the assumption that the population covariance matrix (or equivalently its inverse, the precision matrix) can be decomposed into a diagonal component and a low-rank component. The rank of the low-rank component governs to what extent the decomposition can simplify the covariance/precision matrix and reduce the number of unknown parameters. In the estimation, this rank can either be pre-selected to be small or controlled by a penalty function. Under moderate conditions on the population covariance/precision matrix itself and on the penalty function, we prove some consistency results for our estimator. A blockwise coordinate descent algorithm, which iteratively updates the diagonal component and the low-rank component, is then proposed to obtain the estimator in practice. In the end, we consider jointly estimating large covariance matrices of multiple categories. In addition to the aforementioned diagonal and low-rank matrix decomposition, it is further assumed that there is some common matrix structure shared across the categories. We assume that the population precision matrix of category k can be decomposed into a diagonal matrix D, a shared low-rank matrix L, and a category-specific low-rank matrix Lk. The assumption can be understood under the framework of factor models -- some latent factors affect all categories alike while others are specific to only one of these categories. We propose a method that jointly estimates the precision matrices (therefore, the covariance matrices) -- D and L are estimated with the entire dataset whereas Lk is estimated solely with the data of category k. An AIC-type penalty is applied to encourage the decomposition, especially the shared component. Under certain conditions on the population covariance matrices, some consistency results are developed for the estimators. The performances in finite dimensions are shown through numerical experiments. Using simulated data, we demonstrate certain advantages of our methods over existing ones, in terms of classification error for the discriminant rules and Kullback--Leibler loss for the covariance matrix estimators. The proposed methods are also applied to real life datasets, including microarray data, stock return data and text data, to perform tasks, such as distinguishing normal from diseased tissues, portfolio selection and classifying webpages.


High-Dimensional Covariance Matrix Estimation

High-Dimensional Covariance Matrix Estimation

Author: Aygul Zagidullina

Publisher: Springer

Published: 2021-10-30

Total Pages: 115

ISBN-13: 9783030800642

DOWNLOAD EBOOK

This book presents covariance matrix estimation and related aspects of random matrix theory. It focuses on the sample covariance matrix estimator and provides a holistic description of its properties under two asymptotic regimes: the traditional one, and the high-dimensional regime that better fits the big data context. It draws attention to the deficiencies of standard statistical tools when used in the high-dimensional setting, and introduces the basic concepts and major results related to spectral statistics and random matrix theory under high-dimensional asymptotics in an understandable and reader-friendly way. The aim of this book is to inspire applied statisticians, econometricians, and machine learning practitioners who analyze high-dimensional data to apply the recent developments in their work.


Estimation of Discriminant Analysis Error Rate for High Dimensional Data

Estimation of Discriminant Analysis Error Rate for High Dimensional Data

Author: Patricia K. Lebow

Publisher:

Published: 1992

Total Pages: 192

ISBN-13:

DOWNLOAD EBOOK

Methodologies for data reduction, modeling, and classification of grouped response curves are explored. In particular, the thesis focuses on the analysis of a collection of highly correlated, highly dimensional response-curve data of spectral reflectance curves of wood surface features. In the analysis, questions about the application of cross-validation estimation of discriminant function error rates for data that has been previously transformed by principal component analysis arise. Performing cross-validation requires re-calculating the principal component transformation and discriminant functions of the training sets, a very lengthy process. A more efficient approach of carrying out the cross-validation calculations, plus the alternative of estimating error rates without the re-calculation of the principal component decomposition, are studied to address questions about the cross-validation procedure. If populations are assumed to have common covariance structures, the pooled covariance matrix can be decomposed for the principal component transformation. The leave-one-out cross-validation procedure results in a rank-one update in the pooled covariance matrix for each observation left out. Algorithms have been developed for calculating the updated eigenstructure under rank-one updates and they can be applied to the orthogonal decomposition of the pooled covariance matrix. Use of these algorithms results in much faster computation of error rates, especially when the number of variables is large. The bias and variance of an estimator that performs leave-one-out cross-validation directly on the principal component scores (without re-computation of the principal component transformation for each observation) is also investigated.


Large Sample Covariance Matrices and High-Dimensional Data Analysis

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Author: Jianfeng Yao

Publisher: Cambridge University Press

Published: 2015-03-26

Total Pages: 0

ISBN-13: 9781107065178

DOWNLOAD EBOOK

High-dimensional data appear in many fields, and their analysis has become increasingly important in modern statistics. However, it has long been observed that several well-known methods in multivariate analysis become inefficient, or even misleading, when the data dimension p is larger than, say, several tens. A seminal example is the well-known inefficiency of Hotelling's T2-test in such cases. This example shows that classical large sample limits may no longer hold for high-dimensional data; statisticians must seek new limiting theorems in these instances. Thus, the theory of random matrices (RMT) serves as a much-needed and welcome alternative framework. Based on the authors' own research, this book provides a first-hand introduction to new high-dimensional statistical methods derived from RMT. The book begins with a detailed introduction to useful tools from RMT, and then presents a series of high-dimensional problems with solutions provided by RMT methods.


High-Dimensional Covariance Matrix Estimation

High-Dimensional Covariance Matrix Estimation

Author: Aygul Zagidullina

Publisher: Springer Nature

Published: 2021-10-29

Total Pages: 123

ISBN-13: 3030800652

DOWNLOAD EBOOK

This book presents covariance matrix estimation and related aspects of random matrix theory. It focuses on the sample covariance matrix estimator and provides a holistic description of its properties under two asymptotic regimes: the traditional one, and the high-dimensional regime that better fits the big data context. It draws attention to the deficiencies of standard statistical tools when used in the high-dimensional setting, and introduces the basic concepts and major results related to spectral statistics and random matrix theory under high-dimensional asymptotics in an understandable and reader-friendly way. The aim of this book is to inspire applied statisticians, econometricians, and machine learning practitioners who analyze high-dimensional data to apply the recent developments in their work.


High-Dimensional Covariance Matrix Estimation: Shrinkage Toward a Diagonal Target

High-Dimensional Covariance Matrix Estimation: Shrinkage Toward a Diagonal Target

Author: Mr. Sakai Ando

Publisher: International Monetary Fund

Published: 2023-12-08

Total Pages: 32

ISBN-13:

DOWNLOAD EBOOK

This paper proposes a novel shrinkage estimator for high-dimensional covariance matrices by extending the Oracle Approximating Shrinkage (OAS) of Chen et al. (2009) to target the diagonal elements of the sample covariance matrix. We derive the closed-form solution of the shrinkage parameter and show by simulation that, when the diagonal elements of the true covariance matrix exhibit substantial variation, our method reduces the Mean Squared Error, compared with the OAS that targets an average variance. The improvement is larger when the true covariance matrix is sparser. Our method also reduces the Mean Squared Error for the inverse of the covariance matrix.


Contributions to linear discriminant analysis with applications to growth curves

Contributions to linear discriminant analysis with applications to growth curves

Author: Edward Kanuti Ngailo

Publisher: Linköping University Electronic Press

Published: 2020-05-06

Total Pages: 47

ISBN-13: 9179298567

DOWNLOAD EBOOK

This thesis concerns contributions to linear discriminant analysis with applications to growth curves. Firstly, we present the linear discriminant function coefficients in a stochastic representation using random variables from the standard univariate distributions. We apply the characterized distribution in the classification function to approximate the classification error rate. The results are then extended to large dimension asymptotics under assumption that the dimension p of the parameter space increases together with the sample size n to infinity such that the ratio converges to a positive constant c (0, 1). Secondly, the thesis treats repeated measures data which correspond to multiple measurements that are taken on the same subject at different time points. We develop a linear classification function to classify an individual into one out of two populations on the basis of the repeated measures data that when the means follow a growth curve structure. The growth curve structure we first consider assumes that all treatments (groups) follows the same growth profile. However, this is not necessarily true in general and the problem is extended to linear classification where the means follow an extended growth curve structure, i.e., the treatments under the experimental design follow different growth profiles. At last, a function of the inverse Wishart matrix and a normal distribution finds its application in portfolio theory where the vector of optimal portfolio weights is proportional to the product of the inverse sample covariance matrix and a sample mean vector. Analytical expressions for higher order moments and non-central moments of the portfolio weights are derived when the returns are assumed to be independently multivariate normally distributed. Moreover, the expressions for the mean, variance, skewness and kurtosis of specific estimated weights are obtained. The results are complemented using a Monte Carlo simulation study, where data from the multivariate normal and t-distributions are discussed. Den här avhandlingen studerar diskriminantanalys, klassificering av tillväxtkurvor och portföljteori. Diskriminantanalys och klassificering är flerdimensionella tekniker som används för att separera olika mängder av objekt och för att tilldela nya objekt till redan definierade grupper (så kallade klasser). En klassisk metod är att använda Fishers linjära diskriminantfunktion och när alla parametrar är kända så kan man enkelt beräkna sannolikheterna för felklassificering. Tyvärr är så sällan fallet, utan parametrarna måste skattas från data, och då blir Fishers linjära diskriminantfunktion en funktion av en Wishartmatris och multivariat normalfördelade vektorer. I den här avhandlingen studerar vi hur man kan approximativt beräkna sannolikheten för felklassificering under antagande att dimensionen på parameterrummet ökar tillsammans med antalet observationer genom att använda en särskild stokastisk representation av diskriminantfunktionen. Upprepade mätningar över tiden på samma individ eller objekt går att modellera med så kallade tillväxtkurvor. Vid klassificering av tillväxtkurvor, eller rättare sagt av upprepade mätningar för en ny individ, bör man ta tillvara på både den spatiala- och temporala informationen som finns hos dessa observationer. Vi vidareutvecklar Fishers linjära diskriminantfunktion att passa för upprepade mätningar och beräknar asymptotiska sannolikheter för felklassificering. Till sist kan man notera att snarlika funktioner av Wishartmatriser och multivariat normalfördelade vektorer dyker upp när man vill beräkna de optimala vikterna i portföljteori. Genom en stokastisk representation studerar vi egenskaperna hos portföljvikterna och gör dessutom en simuleringsstudie för att förstå vad som händer när antagandet om normalfördelning inte är uppfyllt.


Shrinkage Estimation for Mean and Covariance Matrices

Shrinkage Estimation for Mean and Covariance Matrices

Author: Hisayuki Tsukuma

Publisher: Springer Nature

Published: 2020-04-16

Total Pages: 119

ISBN-13: 9811515964

DOWNLOAD EBOOK

This book provides a self-contained introduction to shrinkage estimation for matrix-variate normal distribution models. More specifically, it presents recent techniques and results in estimation of mean and covariance matrices with a high-dimensional setting that implies singularity of the sample covariance matrix. Such high-dimensional models can be analyzed by using the same arguments as for low-dimensional models, thus yielding a unified approach to both high- and low-dimensional shrinkage estimations. The unified shrinkage approach not only integrates modern and classical shrinkage estimation, but is also required for further development of the field. Beginning with the notion of decision-theoretic estimation, this book explains matrix theory, group invariance, and other mathematical tools for finding better estimators. It also includes examples of shrinkage estimators for improving standard estimators, such as least squares, maximum likelihood, and minimum risk invariant estimators, and discusses the historical background and related topics in decision-theoretic estimation of parameter matrices. This book is useful for researchers and graduate students in various fields requiring data analysis skills as well as in mathematical statistics.


Large Dimensional Covariance Matrix Estimation with Decomposition-based Regularization

Large Dimensional Covariance Matrix Estimation with Decomposition-based Regularization

Author:

Publisher:

Published: 2014

Total Pages: 129

ISBN-13:

DOWNLOAD EBOOK

Estimation of population covariance matrices from samples of multivariate data is of great importance. When the dimension of a covariance matrix is large but the sample size is limited, it is well known that the sample covariance matrix is dissatisfactory. However, the improvement of covariance matrix estimation is not straightforward, mainly because of the constraint of positive definiteness. This thesis work considers decomposition-based methods to circumvent this primary difficulty. Two ways of covariance matrix estimation with regularization on factor matrices from decompositions are included. One approach replies on the modified Cholesky decomposition from Pourahmadi, and the other technique, matrix exponential or matrix logarithm, is closely related to the spectral decomposition. We explore the usage of covariance matrix estimation by imposing L1 regularization on the entries of Cholesky factor matrices, and find the estimates from this approach are not sensitive to the orders of variables. A given order of variables is the prerequisite in the application of the modified Cholesky decomposition, while in practice, information on the order of variables is often unknown. We take advantage of this property to remove the requirement of order information, and propose an order-invariant covariance matrix estimate by refining estimates corresponding to different orders of variables. The refinement not only guarantees the positive definiteness of the estimated covariance matrix, but also is applicable in general situations without the order of variables being pre-specified. The refined estimate can be approximated by only combining a moderate number of representative estimates. Numerical simulations are conducted to evaluate the performance of the proposed method in comparison with several other estimates. By applying the matrix exponential technique, the problem of estimating positive definite covariance matrices is transformed into a problem of estimating symmetric matrices. There are close connections between covariance matrices and their logarithm matrices, and thus, pursing a matrix logarithm with certain properties helps restoring the original covariance matrix. The covariance matrix estimate from applying L1 regularization to the entries of the matrix logarithm is compared to some other estimates in simulation studies and real data analysis.