Comparing Standard Reference Corpora and Google Books Ngrams : Strengths, Limitations and Synergies in the Contrastive Study of Variable h- in British and American English
Faculty/Professorship: | English and Historical Linguistics |
Author(s): | Sönning, Lukas ![]() ![]() |
Publisher Information: | Bamberg : Otto-Friedrich-Universität |
Year of publication: | 2023 |
Pages: | 17-45 |
Source/Other editions: | Data and Methods in Corpus Linguistics : Comparative Approaches / Schützler, Ole, Schlüter, Julia (Hg.). - Cambridge ; New York : Cambridge University Press, 2022, S. 17-45. - ISBN: 9781108499644 |
is version of: | 10.1017/9781108589314.002 |
Year of first publication: | 2022 |
Language(s): | English |
Licence: | German Act on Copyright |
URN: | urn:nbn:de:bvb:473-irb-595434 |
Abstract: | This chapter is based on two standard reference corpora, the British National Corpus and the Corpus of Contemporary American English, as opposed to the multi-billion-word database of Google Books Ngrams, which has, despite its allure, not been used in many systematic linguistic studies so far. Focusing on indefinite article allomorphy (a vs an) as an orthographic cue to the phonological strength of ‹h›-onsets in British and American English, the size advantage of the Ngrams database expectedly plays out in larger type and token counts, more stable estimates and fewer distortions due to data sparsity. However, as metadata are extremely limited (to year and variety), a fully accountable analysis is not feasible. The case study illustrates how richly annotated corpora can shed light on potential disturbances arising from two sources: genre differences and between-author variability. A sensitivity analysis offers some degree of reassurance when extending the analysis to the Ngrams database. In this way, the authors demonstrate that the strengths and limitations of corpora and big data resources can, with due caution, be counterbalanced to answer questions of linguistic interest. |
GND Keywords: | Englisch; Amerikanisches Englisch; Korpus <Linguistik>; Artikel <Linguistik>; Kontrastive Phonologie |
Keywords: | Google Books Ngrams, big data, metadata, type frequency, token frequency, hierarchical data structure, corpus comparability, data quality |
DDC Classification: | 420 English |
RVK Classification: | HF 450 |
Type: | Contribution to an Articlecollection |
URI: | https://fis.uni-bamberg.de/handle/uniba/59543 |
Release Date: | 25. May 2023 |
File | Size | Format | |
---|---|---|---|
fisba59543.pdf | 1.77 MB | View/Open |

originated at the
University of Bamberg
University of Bamberg