Options
Multi-CAST: Multilingual corpus of annotated spoken texts
Contributor(s):
Publisher Information:
Otto-Friedrich-Universität Bamberg
Year of publication:
2023
Language:
Multilingual/Other
has parts:
Abstract:
Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts (Haig & Schnell 2015), is a collection of annotated spoken-language corpora from a typologically diverse set of languages. Most of the data stem from documentation projects undertaken on lesser-researched and endangered languages. The texts are overwhelmingly unscripted, non-elicited, monologic narratives.
Each corpus in the collection is an individually citable resource that was contributed by experts on the respective languages in cooperation with the collection editors. The Multi-CAST collection as a whole was designed and compiled by Geoffrey Haig and Stefan Schnell with the assistance of Nils Schiborr, and is to date the only freely-available, multilingual, spoken-language corpus that combines morphological and morphosyntactic glossing with annotation of discourse referents. Each Multi-CAST corpus includes audio recordings (as WAV and MP3 files; archived separately, see below), annotation files in a number of file formats (including as EAF files for use with the free linguistic annotation software ELAN, and as TSV and XML files), metadata on the speakers and texts, as well as documentation on the language, speech communities, recording situations, and analytical decisions pertinent to the annotations.
The annotation files use a multi-tier structure built on a time-aligned segmentation of the text into utterance units, from which derive a transcription and idiomatic English translation. Utterance units are segmented further into grammatical words with morphological glossing (following the Leipzig Glossing Rules) and annotations with the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes. Further information on the contents of the collection and the structure of the annotations can be found in the Multi-CAST collection overview (Schiborr 2023), which is included in this archive. The multicastR package (Schiborr 2018) provides a simple interface for directly accessing the Multi-CAST annotation data through the statistical computing language R.
This archive contains version 2311 of the Multi-CAST collection (originally published in November 2023) and comprises data from 18 languages, encompassing around 19 hours of recordings, 29000 clause units, and 140000 words across 136 individual texts. The audio files accompanying these data sets have been archived separately; they can be found via the links in the list below.
Citation for the entire Multi-CAST collection:
Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (DOI: 10.48564/unibafd-q1h0x-3kf71)
Each corpus in the collection is an individually citable resource that was contributed by experts on the respective languages in cooperation with the collection editors. The Multi-CAST collection as a whole was designed and compiled by Geoffrey Haig and Stefan Schnell with the assistance of Nils Schiborr, and is to date the only freely-available, multilingual, spoken-language corpus that combines morphological and morphosyntactic glossing with annotation of discourse referents. Each Multi-CAST corpus includes audio recordings (as WAV and MP3 files; archived separately, see below), annotation files in a number of file formats (including as EAF files for use with the free linguistic annotation software ELAN, and as TSV and XML files), metadata on the speakers and texts, as well as documentation on the language, speech communities, recording situations, and analytical decisions pertinent to the annotations.
The annotation files use a multi-tier structure built on a time-aligned segmentation of the text into utterance units, from which derive a transcription and idiomatic English translation. Utterance units are segmented further into grammatical words with morphological glossing (following the Leipzig Glossing Rules) and annotations with the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes. Further information on the contents of the collection and the structure of the annotations can be found in the Multi-CAST collection overview (Schiborr 2023), which is included in this archive. The multicastR package (Schiborr 2018) provides a simple interface for directly accessing the Multi-CAST annotation data through the statistical computing language R.
This archive contains version 2311 of the Multi-CAST collection (originally published in November 2023) and comprises data from 18 languages, encompassing around 19 hours of recordings, 29000 clause units, and 140000 words across 136 individual texts. The audio files accompanying these data sets have been archived separately; they can be found via the links in the list below.
Citation for the entire Multi-CAST collection:
Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (DOI: 10.48564/unibafd-q1h0x-3kf71)
Type:
Collection
Keywords:
spoken language corpus
Format: ;
audio/mpeg
audio/wav
Permalink
https://fis.uni-bamberg.de/handle/uniba/97633