Building and Using Comparable Corpora

Building and Using Comparable Corpora

Author: Serge Sharoff

Publisher: Springer Science & Business Media

Published: 2013-12-13

Total Pages: 333

ISBN-13: 3642201288

DOWNLOAD EBOOK

The 1990s saw a paradigm change in the use of corpus-driven methods in NLP. In the field of multilingual NLP (such as machine translation and terminology mining) this implied the use of parallel corpora. However, parallel resources are relatively scarce: many more texts are produced daily by native speakers of any given language than translated. This situation resulted in a natural drive towards the use of comparable corpora, i.e. non-parallel texts in the same domain or genre. Nevertheless, this research direction has not produced a single authoritative source suitable for researchers and students coming to the field. The proposed volume provides a reference source, identifying the state of the art in the field as well as future trends. The book is intended for specialists and students in natural language processing, machine translation and computer-assisted translation.


Building and Using Comparable Corpora for Multilingual Natural Language Processing

Building and Using Comparable Corpora for Multilingual Natural Language Processing

Author: Serge Sharoff

Publisher: Springer Nature

Published: 2023-08-23

Total Pages: 138

ISBN-13: 3031313844

DOWNLOAD EBOOK

This book provides a comprehensive overview of methods to build comparable corpora and of their applications, including machine translation, cross-lingual transfer, and various kinds of multilingual natural language processing. The authors begin with a brief history on the topic followed by a comparison to parallel resources and an explanation of why comparable corpora have become more widely used. In particular, they provide the basis for the multilingual capabilities of pre-trained models, such as BERT or GPT. The book then focuses on building comparable corpora, aligning their sentences to create a database of suitable translations, and using these sentence translations to produce dictionaries and term banks. Then, it is explained how comparable corpora can be used to build machine translation engines and to develop a wide variety of multilingual applications.


Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Author: Inguna Skadiņa

Publisher: Springer

Published: 2019-02-06

Total Pages: 326

ISBN-13: 3319990047

DOWNLOAD EBOOK

This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains. The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics.


Corpus Analysis for Language Studies at the University Level

Corpus Analysis for Language Studies at the University Level

Author: Giedrė Valūnaitė Oleškevičienė

Publisher: Cambridge Scholars Publishing

Published: 2021-02-08

Total Pages: 176

ISBN-13: 1527565947

DOWNLOAD EBOOK

This book highlights corpora use in teaching foreign languages in university education. It will appeal to both academics and practitioners interested in the process of teaching foreign languages at more advanced levels while applying corpus analysis and building tools for corpus annotation. It provides a detailed case study of analyzing the terminology of constitutional law in both English and Lithuanian as an example to illustrate the possibility of integrating corpus analysis tools into the process of teaching foreign languages in university education. The book reveals that initial linguistic knowledge is essential when teaching and learning foreign languages at more advanced levels while applying corpus annotation. In addition, it shows that, even though the use of new corpus software is perceived as a positive, there are still certain issues to be solved in this regard, such as the constant renewal of public computers in universities and the technical and methodological support for teachers while using corpora tools.


Investigating Wikipedia

Investigating Wikipedia

Author: Céline Poudat

Publisher: John Benjamins Publishing Company

Published: 2024-11-15

Total Pages: 272

ISBN-13: 9027246467

DOWNLOAD EBOOK

The present volume is intended as a reference book on Wikipedia corpus studies, from corpus construction to exploration and analysis. Wikipedia is a complex object, difficult to manipulate for linguists and corpus researchers. In addition to the encyclopedic articles consulted by millions of users, it contains vast spaces of written discussions, aka talk pages, where Wikipedia authors negotiate the collaborative editing of articles, make evaluations, or discuss related topics. The proposed volume covers Wikipedia articles, their revision histories, and discussions, with a focus on discussions, which have not been studied extensively so far and have also been neglected in previous corpus building efforts. Wikipedia discussions are instances of computer-mediated communication (CMC), thus constituting a completely different, interaction-oriented linguistic genre. Sophisticated tools and methods of linguistic annotation and corpus exploration are needed to exploit the huge and valuable corpus resources that can be constructed from the Wikipedia discussions. The present volume aims at encouraging and facilitating Wikipedia corpus studies, providing standards, recommendations, and innovative methods to build and explore Wikipedia corpora, and presenting corpus studies that make the most of the peculiarities of Wikipedia.


Advances in Natural Language Processing

Advances in Natural Language Processing

Author: Hitoshi Isahara

Publisher: Springer

Published: 2012-10-22

Total Pages: 343

ISBN-13: 3642339832

DOWNLOAD EBOOK

This book constitutes the refereed proceedings of the 8th International Conference on Advances in Natural Language Processing, JapTAL 2012, Kanazawa, Japan, in October 2012. The 27 revised full papers and 5 revised short papers presented were carefully reviewed and selected from 42 submissions. The papers are organized in topical sections on machine translation, multilingual issues, resouces, semantic analysis, sentiment analysis, as well as speech and generation.


Document Analysis and Recognition – ICDAR 2023 Workshops

Document Analysis and Recognition – ICDAR 2023 Workshops

Author: Mickael Coustaty

Publisher: Springer Nature

Published: 2023-08-14

Total Pages: 344

ISBN-13: 3031414985

DOWNLOAD EBOOK

This two-volume set LNCS 14193-14194 constitutes the proceedings of International Workshops co-located with the 17th International Conference on Document Analysis and Recognition, ICDAR 2023, held in San José, CA, USA, during August 21–26, 2023. The total of 43 regular papers presented in this book were carefully selected from 60 submissions. Part I contains 22 regular papers that stem from the following workshops: ICDAR 2023 Workshop on Computational Paleography (IWCP); ICDAR 2023 Workshop on Camera-Based Document Analysis and Recognition (CBDAR); ICDAR 2023 International Workshop on Graphics Recognition (GREC); ICDAR 2023 Workshop on Automatically Domain-Adapted and Personalized Document Analysis (ADAPDA); Part II contains 21 regular papers that stem from the following workshops: ICDAR 2023 Workshop on Machine Vision and NLP for Document Analysis (VINALDO); ICDAR 2023 International Workshop on Machine Learning (WML).


Healthcare Data Analytics

Healthcare Data Analytics

Author: Chandan K. Reddy

Publisher: CRC Press

Published: 2015-06-23

Total Pages: 756

ISBN-13: 148223212X

DOWNLOAD EBOOK

At the intersection of computer science and healthcare, data analytics has emerged as a promising tool for solving problems across many healthcare-related disciplines. Supplying a comprehensive overview of recent healthcare analytics research, Healthcare Data Analytics provides a clear understanding of the analytical techniques currently available


Neural Machine Translation

Neural Machine Translation

Author: Philipp Koehn

Publisher: Cambridge University Press

Published: 2020-06-18

Total Pages: 409

ISBN-13: 1108497322

DOWNLOAD EBOOK

Learn how to build machine translation systems with deep learning from the ground up, from basic concepts to cutting-edge research.


Comparable Corpora and Computer-assisted Translation

Comparable Corpora and Computer-assisted Translation

Author: Estelle Maryline Delpech

Publisher: John Wiley & Sons

Published: 2014-07-22

Total Pages: 221

ISBN-13: 1119002702

DOWNLOAD EBOOK

Computer-assisted translation (CAT) has always used translation memories, which require the translator to have a corpus of previous translations that the CAT software can use to generate bilingual lexicons. This can be problematic when the translator does not have such a corpus, for instance, when the text belongs to an emerging field. To solve this issue, CAT research has looked into the leveraging of comparable corpora, i.e. a set of texts, in two or more languages, which deal with the same topic but are not translations of one another. This work had two primary objectives. The first is to assess the input of lexicons extracted from comparable corpora in the context of a specialized human translation task. The second objective is to identify bilingual-lexicon-extraction methods which best match the translators' needs, determining the current limits of these techniques and suggesting improvements. The author focuses, in particular, on the identification of fertile translations, the management of multiple morphological structures, and the ranking of candidate translations. The experiments are carried out on two language pairs (English–French and English–German) and on specialized texts dealing with breast cancer. This research puts significant emphasis on applicability – methodological choices are guided by the needs of the final users. This book is organized in two parts: the first part presents the applicative and scientific context of the research, and the second part is given over to efforts to improve compositional translation. The research work presented in this book received the PhD Thesis award 2014 from the French association for natural language processing (ATALA).