Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy
Author: Shou-Chun Yin
Publisher:
Published: 2014
Total Pages:
ISBN-13:
DOWNLOAD EBOOK"The objective of this thesis study is to develop statistical modeling techniques for characterizing phonetic variation in automatic speech recognition (ASR). One issue addressed in this domain is to reliably detect the phoneme level mispronunciations in speech utterances that arise from speech therapy applications. Another issue addressed in this work is to study the ability of ASR systems to model the phonetic variation that often exists in speaker-independent recognition tasks. Both issues will be treated as examples of the same basic problem in robustly modeling phonetic variability in ASR.The technical contributions involved in this thesis which address these issues are presented as follows. First, a phoneme level pronunciation verification (PV) scenario is investigated for detecting the mispronunciation occurrences in speech utterances recorded from a population of impaired children with neuromuscular disorders. The well known continuous density hidden Markov model (CDHMM) is used as a phoneme decoder which generates a finite state network of phoneme string hypotheses for input speech utterances.The phoneme level confidence measures can be constructed from this network, and PV decision can be made by comparing the confidence measures with a pre-selected threshold. Some well known state-of-the-art ASR techniques are incorporated in this PV scenario, and the experimental studies show how these techniques can impact the verification accuracy.Second, the subspace Gaussian mixture model (SGMM) formalism is investigated. This acoustic model is shown to provide an efficient model of phonetic variability in speech. In the experimental studies, it can be shown that a 18.74% relative reduction in word error rate with respect to the well known CDHMM acoustic model can be achieved on a medium vocabulary ASR task. Furthermore, it is demonstrated that a 24.79% relative reduction in phone error rate with respect to the CDHMM can be achieved for an unimpaired children speech corpus.Finally, the SGMM is incorporated into a new PV scenario. A new kind of pronunciation confidence measure used for making mispronunciation verification decisions is extracted directly from the state level model parameters. Both session level and utterance level PV scenarios based on the SGMM based confidence measures are proposed. In the session level PV task, the equal error rate can be reduced by 15.35% when combining the SGMM based confidence measures with the above phoneme decoder based confidence measures. In the utterance level PV task, the equal error rate can be reduced by 12.94%. This equal error rate reduction is believed to result from an efficient characterization of pronunciation variation for each phoneme by the SGMM." --