Category:Text corpora

A text corpus is a large collection of (often as diverse as possible) texts in a given language. Text corpora are very useful for checking how often a given expression is used in a language (and which are rare or never occur), thus they might be a huge help to learners who want to check whether a certain construction is natural-sounding or not. They are also extremely useful when doing research about language grammar and how it changes over time. Many corpora contain also additional information about individual words (like their part-of-speech and other inflectional information when applicable, like grammatical case, gender, tense, mood…).

Below is a list of useful publicly-available corpora for Celtic languages.

Goidelic

Irish

Nua-Chorpas na hÉireann (New Corpus for Ireland), also known as the Foclóir.ie corpus

requires registration (free-of-charge)
all kinds of modern texts (fiction, poetry, official documents, newspaper articles, etc. from 20th and 21st century) written by both native and non-native speakers
allows filtering by text type (native/non-native, specific dialect)
part-of-speech tagged
uses Sketch Engine
alternative new interface at focloir.sketchengine.eu
See Irish/Using corpas.focloir.ie for some additional tips

Historical Irish Corpus, also known as the RIA corpus

publicly available
literary texts composed in Irish between 1600 and 1926
part-of-speech tagged but with limited search functionality
allows searching for words in original spelling and by their modern standardized forms

Scottish Gaelic

Corpas na Gàidhlig DASG

corpus of literary texts in Scottish Gaelic, from 12th century to 21st century (but mostly modern, 18th–21st c. texts)
not (?) part-of-speech tagged
based on Corpus Workbench (CWB) and CQPWeb
allows filtering by time period, geographical origin, particular text
allows for complex queries using the CQP language (or at least a limited subset thereof)

Classical Gaelic

Irish Syllabic Poetry corpus

corpus of Classical Gaelic bardic poetry
part-of-speech tagged (although very imperfect since based on tagging method for modern Irish)
uses non-normalized spelling (so finding a form might be difficult sometimes)
based on Bardic Poetry Database
uses Sketch Engine

Historical Irish Corpus (corpus RIA) – mainly a Modern Irish corpus, but also useful for Early Modern texts and sometimes bardic poetry since it contains texts from 1600 and later, see under Irish

Pages in category "Text corpora"

The following 2 pages are in this category, out of 2 total.

G

Gaelic/Using Corpas na Gàidhlig

I

Irish/Using Nua-Chorpas na hÉireann

Category:Text corpora

Contents

Goidelic

Irish

Scottish Gaelic

Classical Gaelic

Pages in category "Text corpora"

G

I

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Language