Multimodal Scene Understanding

Multimodal Scene Understanding

Author: Michael Yang

Publisher: Academic Press

Published: 2019-07-16

Total Pages: 422

ISBN-13: 0128173599

DOWNLOAD EBOOK

Multimodal Scene Understanding: Algorithms, Applications and Deep Learning presents recent advances in multi-modal computing, with a focus on computer vision and photogrammetry. It provides the latest algorithms and applications that involve combining multiple sources of information and describes the role and approaches of multi-sensory data and multi-modal deep learning. The book is ideal for researchers from the fields of computer vision, remote sensing, robotics, and photogrammetry, thus helping foster interdisciplinary interaction and collaboration between these realms. Researchers collecting and analyzing multi-sensory data collections – for example, KITTI benchmark (stereo+laser) - from different platforms, such as autonomous vehicles, surveillance cameras, UAVs, planes and satellites will find this book to be very useful. Contains state-of-the-art developments on multi-modal computing Shines a focus on algorithms and applications Presents novel deep learning topics on multi-sensor fusion and multi-modal deep learning


Multimodal Learning toward Micro-Video Understanding

Multimodal Learning toward Micro-Video Understanding

Author: Liqiang Nie

Publisher: Springer Nature

Published: 2022-05-31

Total Pages: 170

ISBN-13: 3031022556

DOWNLOAD EBOOK

Micro-videos, a new form of user-generated contents, have been spreading widely across various social platforms, such as Vine, Kuaishou, and Tik Tok. Different from traditional long videos, micro-videos are usually recorded by smart mobile devices at any place within a few seconds. Due to its brevity and low bandwidth cost, micro-videos are gaining increasing user enthusiasm. The blossoming of micro-videos opens the door to the possibility of many promising applications, ranging from network content caching to online advertising. Thus, it is highly desirable to develop an effective scheme for the high-order micro-video understanding. Micro-video understanding is, however, non-trivial due to the following challenges: (1) how to represent micro-videos that only convey one or few high-level themes or concepts; (2) how to utilize the hierarchical structure of the venue categories to guide the micro-video analysis; (3) how to alleviate the influence of low-quality caused by complex surrounding environments and the camera shake; (4) how to model the multimodal sequential data, {i.e.}, textual, acoustic, visual, and social modalities, to enhance the micro-video understanding; and (5) how to construct large-scale benchmark datasets for the analysis? These challenges have been largely unexplored to date. In this book, we focus on addressing the challenges presented above by proposing some state-of-the-art multimodal learning theories. To demonstrate the effectiveness of these models, we apply them to three practical tasks of micro-video understanding: popularity prediction, venue category estimation, and micro-video routing. Particularly, we first build three large-scale real-world micro-video datasets for these practical tasks. We then present a multimodal transductive learning framework for micro-video popularity prediction. Furthermore, we introduce several multimodal cooperative learning approaches and a multimodal transfer learning scheme for micro-video venue category estimation. Meanwhile, we develop a multimodal sequential learning approach for micro-video recommendation. Finally, we conclude the book and figure out the future research directions in multimodal learning toward micro-video understanding.


Multi-modal Deep Learning to Understand Vision and Language

Multi-modal Deep Learning to Understand Vision and Language

Author: Shagan Sah

Publisher:

Published: 2018

Total Pages: 138

ISBN-13:

DOWNLOAD EBOOK

"Developing intelligent agents that can perceive and understand the rich visual world around us has been a long-standing goal in the field of artificial intelligence. In the last few years, significant progress has been made towards this goal and deep learning has been attributed to recent incredible advances in general visual and language understanding. Convolutional neural networks have been used to learn image representations while recurrent neural networks have demonstrated the ability to generate text from visual stimuli. In this thesis, we develop methods and techniques using hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances. Towards appreciating these methods, this work is divided into two broad groups. Firstly, we introduce a general purpose attention mechanism modeled using a continuous function for video understanding. The use of an attention based hierarchical approach along with automatic boundary detection advances state-of-the-art video captioning results. We also develop techniques for summarizing and annotating long videos. In the second part, we introduce architectures along with training techniques to produce a common connection space where natural language sentences are efficiently and accurately connected with visual modalities. In this connection space, similar concepts lie close, while dissimilar concepts lie far apart, irrespective` of their modality. We discuss four modality transformations: visual to text, text to visual, visual to visual and text to text. We introduce a novel attention mechanism to align multi-modal embeddings which are learned through a multi-modal metric loss function. The common vector space is shown to enable bidirectional generation of images and text. The learned common vector space is evaluated on multiple image-text datasets for cross-modal retrieval and zero-shot retrieval. The models are shown to advance the state-of-the-art on tasks that require joint processing of images and natural language."--Abstract.


Multimodal Deep Learning Methods for Person Annotation in Video Sequences

Multimodal Deep Learning Methods for Person Annotation in Video Sequences

Author: David Rodríguez Navarro

Publisher:

Published: 2017

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

In unsupervised identity recognition in video sequences systems, which is a very active field of research in computer vision, the use of convolutional neural networks (CNN's) is currently gaining a lot of interest due to the great results that this techniques have been shown in face recognition and verification problems in recent years. In this thesis, the improvement of a CNN applied for face verification will be made in the context of an unsupervised identity annotation system developed for the MediaEval 2016 task. This improvement will be achieved by training the 2016 CNN architecture with images from the task database, which is now possible since we can use the last version outputs, along with a data augmentation method applied to the previously extracted samples. In addition, a new multimodal verification system is implemented merging both visual and audio feature vectors. An evaluation of the margin of improvement that these techniques introduce in the whole system will be made, comparing against the State-of-the-Art. Finally some conclusions will be exposed based on the obtained results will be drawn along with some possible future lines of work.


Multimodal Video Characterization and Summarization

Multimodal Video Characterization and Summarization

Author: Michael A. Smith

Publisher: Springer Science & Business Media

Published: 2005-12-17

Total Pages: 214

ISBN-13: 0387230084

DOWNLOAD EBOOK

Multimodal Video Characterization and Summarization is a valuable research tool for both professionals and academicians working in the video field. This book describes the methodology for using multimodal audio, image, and text technology to characterize video content. This new and groundbreaking science has led to many advances in video understanding, such as the development of a video summary. Applications and methodology for creating video summaries are described, as well as user-studies for evaluation and testing.


Secure System Design and Trustable Computing

Secure System Design and Trustable Computing

Author: Chip-Hong Chang

Publisher: Springer

Published: 2015-09-17

Total Pages: 537

ISBN-13: 3319149717

DOWNLOAD EBOOK

This book provides the foundations for understanding hardware security and trust, which have become major concerns for national security over the past decade. Coverage includes issues related to security and trust in a variety of electronic devices and systems related to the security of hardware, firmware and software, spanning system applications, online transactions and networking services. This serves as an invaluable reference to the state-of-the-art research that is of critical significance to the security of and trust in, modern society’s microelectronic-supported infrastructures.


Learning Video Representation from Self-supervision

Learning Video Representation from Self-supervision

Author: Brian Chen

Publisher:

Published: 2023

Total Pages: 0

ISBN-13:

DOWNLOAD EBOOK

This thesis investigates the problem of learning video representations for video understanding. Previous works have explored the use of data-driven deep learning approaches, which have been shown to be effective in learning useful video representations. However, obtaining large amounts of labeled data can be costly and time-consuming. We investigate self-supervised approach as for multimodal video data to overcome this challenge. Video data typically contains multiple modalities, such as visual, audio, transcribed speech, and textual captions, which can serve as pseudo-labels for representation learning without needing manual labeling. By utilizing these modalities, we can train deep representations over large-scale video data consisting of millions of video clips collected from the internet. We demonstrate the scalability benefits of multimodal self-supervision by achieving new state-of-the-art performance in various domains, including video action recognition, text-to-video retrieval, and text-to-video grounding.


Remote Sensing Imagery

Remote Sensing Imagery

Author: Florence Tupin

Publisher: John Wiley & Sons

Published: 2014-02-19

Total Pages: 277

ISBN-13: 1118898923

DOWNLOAD EBOOK

Dedicated to remote sensing images, from their acquisition to their use in various applications, this book covers the global lifecycle of images, including sensors and acquisition systems, applications such as movement monitoring or data assimilation, and image and data processing. It is organized in three main parts. The first part presents technological information about remote sensing (choice of satellite orbit and sensors) and elements of physics related to sensing (optics and microwave propagation). The second part presents image processing algorithms and their specificities for radar or optical, multi and hyper-spectral images. The final part is devoted to applications: change detection and analysis of time series, elevation measurement, displacement measurement and data assimilation. Offering a comprehensive survey of the domain of remote sensing imagery with a multi-disciplinary approach, this book is suitable for graduate students and engineers, with backgrounds either in computer science and applied math (signal and image processing) or geo-physics. About the Authors Florence Tupin is Professor at Telecom ParisTech, France. Her research interests include remote sensing imagery, image analysis and interpretation, three-dimensional reconstruction, and synthetic aperture radar, especially for urban remote sensing applications. Jordi Inglada works at the Centre National d’Études Spatiales (French Space Agency), Toulouse, France, in the field of remote sensing image processing at the CESBIO laboratory. He is in charge of the development of image processing algorithms for the operational exploitation of Earth observation images, mainly in the field of multi-temporal image analysis for land use and cover change. Jean-Marie Nicolas is Professor at Telecom ParisTech in the Signal and Imaging department. His research interests include the modeling and processing of synthetic aperture radar images.