Dynamic Speech Models

Dynamic Speech Models

Author: Li Deng

Publisher: Morgan & Claypool Publishers

Published: 2006-12-01

Total Pages: 118

ISBN-13: 1598290657

DOWNLOAD EBOOK

Speech dynamics refer to the temporal characteristics in all stages of the human speech communication process. This speech “chain” starts with the formation of a linguistic message in a speaker's brain and ends with the arrival of the message in a listener's brain. Given the intricacy of the dynamic speech process and its fundamental importance in human communication, this monograph is intended to provide a comprehensive material on mathematical models of speech dynamics and to address the following issues: How do we make sense of the complex speech process in terms of its functional role of speech communication? How do we quantify the special role of speech timing? How do the dynamics relate to the variability of speech that has often been said to seriously hamper automatic speech recognition? How do we put the dynamic process of speech into a quantitative form to enable detailed analyses? And finally, how can we incorporate the knowledge of speech dynamics into computerized speech analysis and recognition algorithms? The answers to all these questions require building and applying computational models for the dynamic speech process. What are the compelling reasons for carrying out dynamic speech modeling? We provide the answer in two related aspects. First, scientific inquiry into the human speech code has been relentlessly pursued for several decades. As an essential carrier of human intelligence and knowledge, speech is the most natural form of human communication. Embedded in the speech code are linguistic (as well as para-linguistic) messages, which are conveyed through four levels of the speech chain. Underlying the robust encoding and transmission of the linguistic messages are the speech dynamics at all the four levels. Mathematical modeling of speech dynamics provides an effective tool in the scientific methods of studying the speech chain. Such scientific studies help understand why humans speak as they do and how humans exploit redundancy and variability by way of multitiered dynamic processes to enhance the efficiency and effectiveness of human speech communication. Second, advancement of human language technology, especially that in automatic recognition of natural-style human speech is also expected to benefit from comprehensive computational modeling of speech dynamics. The limitations of current speech recognition technology are serious and are well known. A commonly acknowledged and frequently discussed weakness of the statistical model underlying current speech recognition technology is the lack of adequate dynamic modeling schemes to provide correlation structure across the temporal speech observation sequence. Unfortunately, due to a variety of reasons, the majority of current research activities in this area favor only incremental modifications and improvements to the existing HMM-based state-of-the-art. For example, while the dynamic and correlation modeling is known to be an important topic, most of the systems nevertheless employ only an ultra-weak form of speech dynamics; e.g., differential or delta parameters. Strong-form dynamic speech modeling, which is the focus of this monograph, may serve as an ultimate solution to this problem. After the introduction chapter, the main body of this monograph consists of four chapters. They cover various aspects of theory, algorithms, and applications of dynamic speech models, and provide a comprehensive survey of the research work in this area spanning over past 20~years. This monograph is intended as advanced materials of speech and signal processing for graudate-level teaching, for professionals and engineering practioners, as well as for seasoned researchers and engineers specialized in speech processing


Segmental Models with an Exploration of Acoustic and Lexical Grouping in Automatic Speech Recognition

Segmental Models with an Exploration of Acoustic and Lexical Grouping in Automatic Speech Recognition

Author: Yanzhang He

Publisher:

Published: 2015

Total Pages: 150

ISBN-13:

DOWNLOAD EBOOK

fDeveloping automatic speech recognition (ASR) technologies is of significant importance to facilitate human-machine interactions. The main challenges for ASR development have been centered around designing appropriate statistical models and the modeling targets associated with them. These challenges exist throughout the ASR probabilistic transduction pipeline that aggregrates information from the bottom up: in acoustic modeling, hidden Markov models (HMMs) are used to map the observed speech signals to phoneme targets in a frame-by-frame fashion, suffering from the well-known frame conditional independence assumption and the inability to integrate long-span features; in lexical modeling, phonemes are grouped into a sequence of vocabulary words as a meaningful sentence, where out-of-vocabulary (OOV) words cannot be easily accounted for. The main goal of this dissertation is to apply segmental models - a family of structured prediction models for sequence segmentation and labeling - to tackle these problems by introducing innovative intermediate-level structures into the ASR pipeline via acoustic and lexical grouping.


Mathematical Foundations of Speech and Language Processing

Mathematical Foundations of Speech and Language Processing

Author: Mark Johnson

Publisher: Springer Science & Business Media

Published: 2012-12-06

Total Pages: 292

ISBN-13: 1441990178

DOWNLOAD EBOOK

Speech and language technologies continue to grow in importance as they are used to create natural and efficient interfaces between people and machines, and to automatically transcribe, extract, analyze, and route information from high-volume streams of spoken and written information. The workshops on Mathematical Foundations of Speech Processing and Natural Language Modeling were held in the Fall of 2000 at the University of Minnesota's NSF-sponsored Institute for Mathematics and Its Applications, as part of a "Mathematics in Multimedia" year-long program. Each workshop brought together researchers in the respective technologies on the one hand, and mathematicians and statisticians on the other hand, for an intensive week of cross-fertilization. There is a long history of benefit from introducing mathematical techniques and ideas to speech and language technologies. Examples include the source-channel paradigm, hidden Markov models, decision trees, exponential models and formal languages theory. It is likely that new mathematical techniques, or novel applications of existing techniques, will once again prove pivotal for moving the field forward. This volume consists of original contributions presented by participants during the two workshops. Topics include language modeling, prosody, acoustic-phonetic modeling, and statistical methodology.


Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy

Efficient Models of Intrinsic Variability in Speech Recognition and Speech Therapy

Author: Shou-Chun Yin

Publisher:

Published: 2014

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

"The objective of this thesis study is to develop statistical modeling techniques for characterizing phonetic variation in automatic speech recognition (ASR). One issue addressed in this domain is to reliably detect the phoneme level mispronunciations in speech utterances that arise from speech therapy applications. Another issue addressed in this work is to study the ability of ASR systems to model the phonetic variation that often exists in speaker-independent recognition tasks. Both issues will be treated as examples of the same basic problem in robustly modeling phonetic variability in ASR.The technical contributions involved in this thesis which address these issues are presented as follows. First, a phoneme level pronunciation verification (PV) scenario is investigated for detecting the mispronunciation occurrences in speech utterances recorded from a population of impaired children with neuromuscular disorders. The well known continuous density hidden Markov model (CDHMM) is used as a phoneme decoder which generates a finite state network of phoneme string hypotheses for input speech utterances.The phoneme level confidence measures can be constructed from this network, and PV decision can be made by comparing the confidence measures with a pre-selected threshold. Some well known state-of-the-art ASR techniques are incorporated in this PV scenario, and the experimental studies show how these techniques can impact the verification accuracy.Second, the subspace Gaussian mixture model (SGMM) formalism is investigated. This acoustic model is shown to provide an efficient model of phonetic variability in speech. In the experimental studies, it can be shown that a 18.74% relative reduction in word error rate with respect to the well known CDHMM acoustic model can be achieved on a medium vocabulary ASR task. Furthermore, it is demonstrated that a 24.79% relative reduction in phone error rate with respect to the CDHMM can be achieved for an unimpaired children speech corpus.Finally, the SGMM is incorporated into a new PV scenario. A new kind of pronunciation confidence measure used for making mispronunciation verification decisions is extracted directly from the state level model parameters. Both session level and utterance level PV scenarios based on the SGMM based confidence measures are proposed. In the session level PV task, the equal error rate can be reduced by 15.35% when combining the SGMM based confidence measures with the above phoneme decoder based confidence measures. In the utterance level PV task, the equal error rate can be reduced by 12.94%. This equal error rate reduction is believed to result from an efficient characterization of pronunciation variation for each phoneme by the SGMM." --


Computing PROSODY

Computing PROSODY

Author: Yoshinori Sagisaka

Publisher: Springer Science & Business Media

Published: 2012-12-06

Total Pages: 405

ISBN-13: 1461222583

DOWNLOAD EBOOK

This book presents a collection of papers from the Spring 1995 Work shop on Computational Approaches to Processing the Prosody of Spon taneous Speech, hosted by the ATR Interpreting Telecommunications Re search Laboratories in Kyoto, Japan. The workshop brought together lead ing researchers in the fields of speech and signal processing, electrical en gineering, psychology, and linguistics, to discuss aspects of spontaneous speech prosody and to suggest approaches to its computational analysis and modelling. The book is divided into four sections. Part I gives an overview and theoretical background to the nature of spontaneous speech, differentiating it from the lab-speech that has been the focus of so many earlier analyses. Part II focuses on the prosodic features of discourse and the structure of the spoken message, Part ilIon the generation and modelling of prosody for computer speech synthesis. Part IV discusses how prosodic information can be used in the context of automatic speech recognition. Each section of the book starts with an invited overview paper to situate the chapters in the context of current research. We feel that this collection of papers offers interesting insights into the scope and nature of the problems concerned with the computational analysis and modelling of real spontaneous speech, and expect that these works will not only form the basis of further developments in each field but also merge to form an integrated computational model of prosody for a better understanding of human processing of the complex interactions of the speech chain.