Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy

Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy

Author: Shou-Chun Yin

Publisher:

Published: 2014

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

"The objective of this thesis study is to develop statistical modeling techniques for characterizing phonetic variation in automatic speech recognition (ASR). One issue addressed in this domain is to reliably detect the phoneme level mispronunciations in speech utterances that arise from speech therapy applications. Another issue addressed in this work is to study the ability of ASR systems to model the phonetic variation that often exists in speaker-independent recognition tasks. Both issues will be treated as examples of the same basic problem in robustly modeling phonetic variability in ASR.The technical contributions involved in this thesis which address these issues are presented as follows. First, a phoneme level pronunciation verification (PV) scenario is investigated for detecting the mispronunciation occurrences in speech utterances recorded from a population of impaired children with neuromuscular disorders. The well known continuous density hidden Markov model (CDHMM) is used as a phoneme decoder which generates a finite state network of phoneme string hypotheses for input speech utterances.The phoneme level confidence measures can be constructed from this network, and PV decision can be made by comparing the confidence measures with a pre-selected threshold. Some well known state-of-the-art ASR techniques are incorporated in this PV scenario, and the experimental studies show how these techniques can impact the verification accuracy.Second, the subspace Gaussian mixture model (SGMM) formalism is investigated. This acoustic model is shown to provide an efficient model of phonetic variability in speech. In the experimental studies, it can be shown that a 18.74% relative reduction in word error rate with respect to the well known CDHMM acoustic model can be achieved on a medium vocabulary ASR task. Furthermore, it is demonstrated that a 24.79% relative reduction in phone error rate with respect to the CDHMM can be achieved for an unimpaired children speech corpus.Finally, the SGMM is incorporated into a new PV scenario. A new kind of pronunciation confidence measure used for making mispronunciation verification decisions is extracted directly from the state level model parameters. Both session level and utterance level PV scenarios based on the SGMM based confidence measures are proposed. In the session level PV task, the equal error rate can be reduced by 15.35% when combining the SGMM based confidence measures with the above phoneme decoder based confidence measures. In the utterance level PV task, the equal error rate can be reduced by 12.94%. This equal error rate reduction is believed to result from an efficient characterization of pronunciation variation for each phoneme by the SGMM." --


Dynamic Speech Models

Dynamic Speech Models

Author: Li Deng

Publisher: Morgan & Claypool Publishers

Published: 2006-12-01

Total Pages: 118

ISBN-13: 1598290657

DOWNLOAD EBOOK

Speech dynamics refer to the temporal characteristics in all stages of the human speech communication process. This speech “chain” starts with the formation of a linguistic message in a speaker's brain and ends with the arrival of the message in a listener's brain. Given the intricacy of the dynamic speech process and its fundamental importance in human communication, this monograph is intended to provide a comprehensive material on mathematical models of speech dynamics and to address the following issues: How do we make sense of the complex speech process in terms of its functional role of speech communication? How do we quantify the special role of speech timing? How do the dynamics relate to the variability of speech that has often been said to seriously hamper automatic speech recognition? How do we put the dynamic process of speech into a quantitative form to enable detailed analyses? And finally, how can we incorporate the knowledge of speech dynamics into computerized speech analysis and recognition algorithms? The answers to all these questions require building and applying computational models for the dynamic speech process. What are the compelling reasons for carrying out dynamic speech modeling? We provide the answer in two related aspects. First, scientific inquiry into the human speech code has been relentlessly pursued for several decades. As an essential carrier of human intelligence and knowledge, speech is the most natural form of human communication. Embedded in the speech code are linguistic (as well as para-linguistic) messages, which are conveyed through four levels of the speech chain. Underlying the robust encoding and transmission of the linguistic messages are the speech dynamics at all the four levels. Mathematical modeling of speech dynamics provides an effective tool in the scientific methods of studying the speech chain. Such scientific studies help understand why humans speak as they do and how humans exploit redundancy and variability by way of multitiered dynamic processes to enhance the efficiency and effectiveness of human speech communication. Second, advancement of human language technology, especially that in automatic recognition of natural-style human speech is also expected to benefit from comprehensive computational modeling of speech dynamics. The limitations of current speech recognition technology are serious and are well known. A commonly acknowledged and frequently discussed weakness of the statistical model underlying current speech recognition technology is the lack of adequate dynamic modeling schemes to provide correlation structure across the temporal speech observation sequence. Unfortunately, due to a variety of reasons, the majority of current research activities in this area favor only incremental modifications and improvements to the existing HMM-based state-of-the-art. For example, while the dynamic and correlation modeling is known to be an important topic, most of the systems nevertheless employ only an ultra-weak form of speech dynamics; e.g., differential or delta parameters. Strong-form dynamic speech modeling, which is the focus of this monograph, may serve as an ultimate solution to this problem. After the introduction chapter, the main body of this monograph consists of four chapters. They cover various aspects of theory, algorithms, and applications of dynamic speech models, and provide a comprehensive survey of the research work in this area spanning over past 20~years. This monograph is intended as advanced materials of speech and signal processing for graudate-level teaching, for professionals and engineering practioners, as well as for seasoned researchers and engineers specialized in speech processing


Talker Variability in Speech Processing

Talker Variability in Speech Processing

Author: Keith Johnson

Publisher:

Published: 1997

Total Pages: 264

ISBN-13:

DOWNLOAD EBOOK

In this text, the editors aim to convert the mapping of speech patterns into mental representations. They cover theories of perception and cognition, issues in clinical speech pathology, and the practical concerns of speech technology.


invariance and Variability in Speech Processes

invariance and Variability in Speech Processes

Author: J. S. Perkell

Publisher: Psychology Press

Published: 2014-01-14

Total Pages: 699

ISBN-13: 1317768280

DOWNLOAD EBOOK

First published in 1986. The important implications of speech variability for the future of speech related technology, in combination with the multifaceted debate about invariance among speech scientists, make this a most appropriate time to evaluate the state our knowledge in this area. On October 8-10, 1983 researchers from the fields of production, perception, acoustics, pathology, psychology, linguistics, language acquisition, synthesis and recognition met at a. symposium at M.I.T. on invariance and variability of speech processes. This volume is the Proceedings of the symposium. Each chapter of the book consists of a focus paper followed by some comments.


New Paradigms for Modeling Acoustic Variation in Speech Processing

New Paradigms for Modeling Acoustic Variation in Speech Processing

Author: Sina Hamidi Ghalehjegh

Publisher:

Published: 2016

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

"A speech signal consists of several sources of information including that associated with the sequence of phonemes in the spoken language and physiological characteristics of the speaker. Depending on the application of the speech processing system, some of this information is considered relevant to the task and some information is considered to be unwanted variability. A desired system should effectively characterize the relevant information sources, while eliminating the irrelevant variability. Fulfilling this, however, is not an easy task, because there are so many variations in acoustic conditions, speaker populations and channel conditions. As a result, there will always be an unseen context, a new speaker or an unseen environment whose characteristics are poorly represented by the system. The objective of this dissertation is to address these issues. There are four major contributions in this work.First, a technique for reducing speaker and channel variabilities is investigated in the subspace Gaussian mixture model (SGMM) framework for automatic speech recognition (ASR). The SGMM differs from the more well-known Gaussian mixture model (GMM) in that the majority of its parameters are shared across all the hidden Markov model (HMM) states and a relatively small number of parameters are state-specific. The sharing mechanism allows training ASR systems for speech datasets with limited amount of data using out-of-domain data. However, it can be problematic if the sources of data are from differing acoustic and channel conditions. An acoustic normalization technique is proposed for compensating for these sources of mismatch.Second, a two-stage speaker adaptation technique is investigated in the context of the SGMM for ASR. In the first stage, an efficient approach is presented for adapting the state-specific parameters in the SGMM. This is motivated by the study that shows state-specific parameters provide a compact and well-behaved characterization of phonetic information in the speech. In the second stage, an efficient approach is presented for a feature-space adaptation in the SGMM. Third, the use of a graph embedding framework is investigated as a regularization technique in the speaker adaptation formalism for the GMM. The technique is motivated by the fact that graph embeddings of feature vectors provide useful characterizations of the underlying manifolds on which these features lie. Incorporating these characteristics in the optimization criteria for the speaker adaptation algorithm has the effect of constraining the solution space in a way that preserves the local structure of the data. This is important, since graph embedding is generally done offline in an unsupervised manner. Therefore, large amounts of unlabeled data could potentially be used to improve the performance of the speaker adaptation technique.Finally, a technique for reducing phonetic variability is investigated for speaker verification systems. A deep neural network (DNN), trained to discriminate among speakers, is applied to improve performance in speaker verification. Features obtained from the DNN are used in an i-vector-based speaker verification system. The features derived from this network are thought to be more robust with respect to phonetic variability, which is generally considered to have a negative impact on the performance. It is found that improved performance can be obtained by appending these features to the more widely used Mel-frequency cepstrum coefficients (MFCCs)." --


Nonlinear Speech Modeling and Applications

Nonlinear Speech Modeling and Applications

Author: Gerard Chollet

Publisher: Springer Science & Business Media

Published: 2005-07-04

Total Pages: 444

ISBN-13: 3540274413

DOWNLOAD EBOOK

This book presents the revised tutorial lectures given at the International Summer School on Nonlinear Speech Processing-Algorithms and Analysis held in Vietri sul Mare, Salerno, Italy in September 2004. The 14 revised tutorial lectures by leading international researchers are organized in topical sections on dealing with nonlinearities in speech signals, acoustic-to-articulatory modeling of speech phenomena, data driven and speech processing algorithms, and algorithms and models based on speech perception mechanisms. Besides the tutorial lectures, 15 revised reviewed papers are included presenting original research results on task oriented speech applications.


Advances in Non-Linear Modeling for Speech Processing

Advances in Non-Linear Modeling for Speech Processing

Author: Raghunath S. Holambe

Publisher: Springer Science & Business Media

Published: 2012-02-21

Total Pages: 109

ISBN-13: 1461415047

DOWNLOAD EBOOK

Advances in Non-Linear Modeling for Speech Processing includes advanced topics in non-linear estimation and modeling techniques along with their applications to speaker recognition. Non-linear aeroacoustic modeling approach is used to estimate the important fine-structure speech events, which are not revealed by the short time Fourier transform (STFT). This aeroacostic modeling approach provides the impetus for the high resolution Teager energy operator (TEO). This operator is characterized by a time resolution that can track rapid signal energy changes within a glottal cycle. The cepstral features like linear prediction cepstral coefficients (LPCC) and mel frequency cepstral coefficients (MFCC) are computed from the magnitude spectrum of the speech frame and the phase spectra is neglected. To overcome the problem of neglecting the phase spectra, the speech production system can be represented as an amplitude modulation-frequency modulation (AM-FM) model. To demodulate the speech signal, to estimation the amplitude envelope and instantaneous frequency components, the energy separation algorithm (ESA) and the Hilbert transform demodulation (HTD) algorithm are discussed. Different features derived using above non-linear modeling techniques are used to develop a speaker identification system. Finally, it is shown that, the fusion of speech production and speech perception mechanisms can lead to a robust feature set.


Cognitive Models of Speech Processing

Cognitive Models of Speech Processing

Author: Gerry T. M. Altmann

Publisher: MIT Press

Published: 1995

Total Pages: 560

ISBN-13: 9780262510844

DOWNLOAD EBOOK

Cognitive Models of Speech Processing presents extensive reviews of current thinking on psycholinguistic and computational topics in speech recognition and natural-language processing, along with a substantial body of new experimental data and computational simulations. Topics range from lexical access and the recognition of words in continuous speech to syntactic processing and the relationship between syntactic and intonational structure. A Bradford Book. ACL-MIT Press Series in Natural Language Processing