Talker Variability in Speech Processing

Talker Variability in Speech Processing

Author: Keith Johnson

Publisher:

Published: 1997

Total Pages: 264

ISBN-13:

DOWNLOAD EBOOK

In this text, the editors aim to convert the mapping of speech patterns into mental representations. They cover theories of perception and cognition, issues in clinical speech pathology, and the practical concerns of speech technology.


Dynamic Speech Models

Dynamic Speech Models

Author: Li Deng

Publisher: Morgan & Claypool Publishers

Published: 2006-12-01

Total Pages: 118

ISBN-13: 1598290657

DOWNLOAD EBOOK

Speech dynamics refer to the temporal characteristics in all stages of the human speech communication process. This speech “chain” starts with the formation of a linguistic message in a speaker's brain and ends with the arrival of the message in a listener's brain. Given the intricacy of the dynamic speech process and its fundamental importance in human communication, this monograph is intended to provide a comprehensive material on mathematical models of speech dynamics and to address the following issues: How do we make sense of the complex speech process in terms of its functional role of speech communication? How do we quantify the special role of speech timing? How do the dynamics relate to the variability of speech that has often been said to seriously hamper automatic speech recognition? How do we put the dynamic process of speech into a quantitative form to enable detailed analyses? And finally, how can we incorporate the knowledge of speech dynamics into computerized speech analysis and recognition algorithms? The answers to all these questions require building and applying computational models for the dynamic speech process. What are the compelling reasons for carrying out dynamic speech modeling? We provide the answer in two related aspects. First, scientific inquiry into the human speech code has been relentlessly pursued for several decades. As an essential carrier of human intelligence and knowledge, speech is the most natural form of human communication. Embedded in the speech code are linguistic (as well as para-linguistic) messages, which are conveyed through four levels of the speech chain. Underlying the robust encoding and transmission of the linguistic messages are the speech dynamics at all the four levels. Mathematical modeling of speech dynamics provides an effective tool in the scientific methods of studying the speech chain. Such scientific studies help understand why humans speak as they do and how humans exploit redundancy and variability by way of multitiered dynamic processes to enhance the efficiency and effectiveness of human speech communication. Second, advancement of human language technology, especially that in automatic recognition of natural-style human speech is also expected to benefit from comprehensive computational modeling of speech dynamics. The limitations of current speech recognition technology are serious and are well known. A commonly acknowledged and frequently discussed weakness of the statistical model underlying current speech recognition technology is the lack of adequate dynamic modeling schemes to provide correlation structure across the temporal speech observation sequence. Unfortunately, due to a variety of reasons, the majority of current research activities in this area favor only incremental modifications and improvements to the existing HMM-based state-of-the-art. For example, while the dynamic and correlation modeling is known to be an important topic, most of the systems nevertheless employ only an ultra-weak form of speech dynamics; e.g., differential or delta parameters. Strong-form dynamic speech modeling, which is the focus of this monograph, may serve as an ultimate solution to this problem. After the introduction chapter, the main body of this monograph consists of four chapters. They cover various aspects of theory, algorithms, and applications of dynamic speech models, and provide a comprehensive survey of the research work in this area spanning over past 20~years. This monograph is intended as advanced materials of speech and signal processing for graudate-level teaching, for professionals and engineering practioners, as well as for seasoned researchers and engineers specialized in speech processing


Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy

Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy

Author: Shou-Chun Yin

Publisher:

Published: 2014

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

"The objective of this thesis study is to develop statistical modeling techniques for characterizing phonetic variation in automatic speech recognition (ASR). One issue addressed in this domain is to reliably detect the phoneme level mispronunciations in speech utterances that arise from speech therapy applications. Another issue addressed in this work is to study the ability of ASR systems to model the phonetic variation that often exists in speaker-independent recognition tasks. Both issues will be treated as examples of the same basic problem in robustly modeling phonetic variability in ASR.The technical contributions involved in this thesis which address these issues are presented as follows. First, a phoneme level pronunciation verification (PV) scenario is investigated for detecting the mispronunciation occurrences in speech utterances recorded from a population of impaired children with neuromuscular disorders. The well known continuous density hidden Markov model (CDHMM) is used as a phoneme decoder which generates a finite state network of phoneme string hypotheses for input speech utterances.The phoneme level confidence measures can be constructed from this network, and PV decision can be made by comparing the confidence measures with a pre-selected threshold. Some well known state-of-the-art ASR techniques are incorporated in this PV scenario, and the experimental studies show how these techniques can impact the verification accuracy.Second, the subspace Gaussian mixture model (SGMM) formalism is investigated. This acoustic model is shown to provide an efficient model of phonetic variability in speech. In the experimental studies, it can be shown that a 18.74% relative reduction in word error rate with respect to the well known CDHMM acoustic model can be achieved on a medium vocabulary ASR task. Furthermore, it is demonstrated that a 24.79% relative reduction in phone error rate with respect to the CDHMM can be achieved for an unimpaired children speech corpus.Finally, the SGMM is incorporated into a new PV scenario. A new kind of pronunciation confidence measure used for making mispronunciation verification decisions is extracted directly from the state level model parameters. Both session level and utterance level PV scenarios based on the SGMM based confidence measures are proposed. In the session level PV task, the equal error rate can be reduced by 15.35% when combining the SGMM based confidence measures with the above phoneme decoder based confidence measures. In the utterance level PV task, the equal error rate can be reduced by 12.94%. This equal error rate reduction is believed to result from an efficient characterization of pronunciation variation for each phoneme by the SGMM." --


Computational Models of Speech Pattern Processing

Computational Models of Speech Pattern Processing

Author: Keith Ponting

Publisher: Springer Science & Business Media

Published: 2012-12-06

Total Pages: 478

ISBN-13: 3642600875

DOWNLOAD EBOOK

Proceedings of the NATO Advanced Study Institute on Computational Models of Speech Pattern Processing, held in St. Helier, Jersey, UK, July 7-18, 1997


New Paradigms for Modeling Acoustic Variation in Speech Processing

New Paradigms for Modeling Acoustic Variation in Speech Processing

Author: Sina Hamidi Ghalehjegh

Publisher:

Published: 2016

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

"A speech signal consists of several sources of information including that associated with the sequence of phonemes in the spoken language and physiological characteristics of the speaker. Depending on the application of the speech processing system, some of this information is considered relevant to the task and some information is considered to be unwanted variability. A desired system should effectively characterize the relevant information sources, while eliminating the irrelevant variability. Fulfilling this, however, is not an easy task, because there are so many variations in acoustic conditions, speaker populations and channel conditions. As a result, there will always be an unseen context, a new speaker or an unseen environment whose characteristics are poorly represented by the system. The objective of this dissertation is to address these issues. There are four major contributions in this work.First, a technique for reducing speaker and channel variabilities is investigated in the subspace Gaussian mixture model (SGMM) framework for automatic speech recognition (ASR). The SGMM differs from the more well-known Gaussian mixture model (GMM) in that the majority of its parameters are shared across all the hidden Markov model (HMM) states and a relatively small number of parameters are state-specific. The sharing mechanism allows training ASR systems for speech datasets with limited amount of data using out-of-domain data. However, it can be problematic if the sources of data are from differing acoustic and channel conditions. An acoustic normalization technique is proposed for compensating for these sources of mismatch.Second, a two-stage speaker adaptation technique is investigated in the context of the SGMM for ASR. In the first stage, an efficient approach is presented for adapting the state-specific parameters in the SGMM. This is motivated by the study that shows state-specific parameters provide a compact and well-behaved characterization of phonetic information in the speech. In the second stage, an efficient approach is presented for a feature-space adaptation in the SGMM. Third, the use of a graph embedding framework is investigated as a regularization technique in the speaker adaptation formalism for the GMM. The technique is motivated by the fact that graph embeddings of feature vectors provide useful characterizations of the underlying manifolds on which these features lie. Incorporating these characteristics in the optimization criteria for the speaker adaptation algorithm has the effect of constraining the solution space in a way that preserves the local structure of the data. This is important, since graph embedding is generally done offline in an unsupervised manner. Therefore, large amounts of unlabeled data could potentially be used to improve the performance of the speaker adaptation technique.Finally, a technique for reducing phonetic variability is investigated for speaker verification systems. A deep neural network (DNN), trained to discriminate among speakers, is applied to improve performance in speaker verification. Features obtained from the DNN are used in an i-vector-based speaker verification system. The features derived from this network are thought to be more robust with respect to phonetic variability, which is generally considered to have a negative impact on the performance. It is found that improved performance can be obtained by appending these features to the more widely used Mel-frequency cepstrum coefficients (MFCCs)." --


Learning Models of Speaker Variation

Learning Models of Speaker Variation

Author: Carnegie-Mellon University. Computer Science Dept

Publisher:

Published: 1996

Total Pages: 194

ISBN-13:

DOWNLOAD EBOOK

Abstract: "Speaker based variability is an important component of the speech signal, whether it is regarded as a nuisance, impeding speech recognition, or a goal, improving speech synthesis. Although many speech recognisers attempt to avoid errors caused by speaker variation, and a few synthesisers attempt to produce a wide range of voices, these efforts tend to be narrowly focused on the task at hand, rather than based on a general model of the variation. What work has been done on modelling variability itself, on the other hand, has mainly aimed at understanding specific linguistic events, rather than at providing an implementation that is practical. This thesis attempts to bridge the gap between these two approaches, by using statistical and connectionist techniques to separate out, and to model, the speaker variability component of the speech signal. A number of these models are built and examined for speaker specificity and speed of convergence. Two applications for speaker models are studied with mixed results: speaker adaptation without parameter reestimation for recognition, and mimicry by transforming the voice personality of synthetic speech."