Difference between revisions of "Category:Text corpora"
(→Manx) |
|||
Line 44: | Line 44: | ||
:* allows filtering by time period, geographical origin, particular text | :* allows filtering by time period, geographical origin, particular text | ||
:* allows complex queries using the CQP-syntax (but limited due to the lack of POS-tagging) | :* allows complex queries using the CQP-syntax (but limited due to the lack of POS-tagging) | ||
:* See [[Gaelic/Using Corpas na Gàidhlig]] for some additional tips | |||
=== Classical Gaelic === | === Classical Gaelic === |
Revision as of 18:35, 27 March 2023
A text corpus is a large collection of (often as diverse as possible) texts in a given language. Text corpora are very useful for checking how often a given expression is used in a language (and which are rare or never occur), thus they might be a huge help to learners who want to check whether a certain construction is natural-sounding or not. They are also extremely useful when doing research about language grammar and how it changes over time. Many corpora contain also additional information about individual words (like their part-of-speech and other inflectional information when applicable, like grammatical case, gender, tense, mood…).
Below is a list of useful publicly-available corpora for Celtic languages.
Brittonic
Breton
Cornish
Welsh
Goidelic
Irish
- Nua-Chorpas na hÉireann (New Corpus for Ireland), also known as the Foclóir.ie corpus
- requires registration (free-of-charge)
- all kinds of modern texts (fiction, poetry, official documents, newspaper articles, etc. from 20th and 21st century) written by both native and non-native speakers
- allows filtering by text type (native/non-native, specific dialect)
- part-of-speech tagged
- uses Sketch Engine
- alternative new interface at focloir.sketchengine.eu
- See Irish/Using Nua-Chorpas na hÉireann for some additional tips
- Historical Irish Corpus, also known as the RIA corpus
- publicly available
- literary texts composed in Irish between 1600 and 1926
- part-of-speech tagged but with limited search functionality
- allows searching for words in original spelling and by their modern standardized forms
Manx
- publicly available
- over 400 texts between 1610 and present, accompanied by English translations
- focus on pre-1908 native Manx literature with the aim to store everything written in Manx before 1908
- open source, the search interface software hosted at https://github.com/david-allison/manx-corpus-search, corpus data at https://github.com/david-allison/manx-search-data
- if you need assistance, have a feature request, etc. you may contact the corpus’ maintainer on Github or on Celtic Languages Discord (Discord handle DavidA#0813)
Scottish Gaelic
- publicly available
- contains literary texts in Scottish Gaelic, from 12th century to 21st century (but mostly modern, 18th–21st c. texts; might be extended in the future with transcriptions of spoken language)
- not part-of-speech tagged
- based on Corpus Workbench (CWB) with modified CQPWeb interface
- allows filtering by time period, geographical origin, particular text
- allows complex queries using the CQP-syntax (but limited due to the lack of POS-tagging)
- See Gaelic/Using Corpas na Gàidhlig for some additional tips
Classical Gaelic
- corpus of Classical Gaelic bardic poetry
- part-of-speech tagged (although very imperfect since based on tagging method for modern Irish)
- uses non-normalized spelling (so finding a form might be difficult sometimes)
- based on Bardic Poetry Database
- uses Sketch Engine
- Historical Irish Corpus (corpus RIA) – mainly a Modern Irish corpus, but also useful for Early Modern texts and sometimes bardic poetry since it contains texts from 1600 and later, see under Irish
Pages in category "Text corpora"
The following 2 pages are in this category, out of 2 total.