Difference between revisions of "Category:Text corpora"

Latest revision as of 06:55, 22 August 2025

A text corpus is a large collection of (often as diverse as possible) texts in a given language. Text corpora are very useful for checking how often a given expression is used in a language (and which are rare or never occur), thus they might be a huge help to learners who want to check whether a certain construction is natural-sounding or not. They are also extremely useful when doing research about language grammar and how it changes over time. Many corpora contain also additional information about individual words (like their part-of-speech and other inflectional information when applicable, like grammatical case, gender, tense, mood…).

Below is a list of useful publicly-available corpora for Celtic languages.

Brittonic

Breton

Cornish

Welsh

Goidelic

Irish

Nua-Chorpas na hÉireann (New Corpus for Ireland), also known as the Foclóir.ie corpus

requires registration (free-of-charge)
all kinds of modern texts (fiction, poetry, official documents, newspaper articles, etc. from 20th and 21st century) written by both native and non-native speakers
allows filtering by text type (native/non-native, specific dialect)
part-of-speech tagged
uses Sketch Engine
alternative new interface at focloir.sketchengine.eu
See Irish/Using Nua-Chorpas na hÉireann for some additional tips

Corpas Stairiúil na Gaeilge (Historical Irish Corpus), also known as the RIA corpus

publicly available
literary texts composed in Irish between 1600 and 1926
part-of-speech tagged but with limited search functionality
allows searching for words in original spelling and by their modern standardized forms

Corpas Náisiúnta na Gaeilge (National Corpus of Irish), also known as the Gaois corpus

publicly available
all texts compiled for the corpus span the 2000-2024 period
part-of-speech tagged
uses Sketch Engine

Manx

Manx Corpus Search

publicly available
over 600 texts between 1610 and present, accompanied by English translations
focus on pre-1908 native Manx literature with the aim to store everything written in Manx before 1908
open source, the search interface software hosted at https://github.com/david-allison/manx-corpus-search, corpus data at https://github.com/david-allison/manx-search-data
if you need assistance, have a feature request, etc. you may contact the corpus’ maintainer on Github or on Celtic Languages Discord (Discord handle davida0813)

Scottish Gaelic

Corpas na Gàidhlig DASG

publicly available
contains literary texts in Scottish Gaelic, from 12th century to 21st century (but mostly modern, 18th–21st c. texts; might be extended in the future with transcriptions of spoken language)
not part-of-speech tagged
based on Corpus Workbench (CWB) with modified CQPWeb interface
allows filtering by time period, geographical origin, particular text
allows complex queries using the CQP-syntax (but limited due to the lack of POS-tagging)
See Gaelic/Using Corpas na Gàidhlig for some additional tips

Classical Gaelic

Irish Syllabic Poetry corpus

corpus of Classical Gaelic bardic poetry
part-of-speech tagged (although very imperfect since based on tagging method for modern Irish)
uses non-normalized spelling (so finding a form might be difficult sometimes)
based on Bardic Poetry Database
uses Sketch Engine

Historical Irish Corpus (corpus RIA) – mainly a Modern Irish corpus, but also useful for Early Modern texts and sometimes bardic poetry since it contains texts from 1600 and later, see under Irish

Old Irish

Corpus Palaeo-Hibernicum – created by the ChronHib project

78 Old Irish texts: OIr. glosses, Annals of Ulster, poems of Blathmac, and some tales
fully translated and with full annotations of morphological forms
it’s structured around a spreadsheet-like table interface – searching for specific forms requires displaying the whole table though (so choose the biggest “Results per Page” value at the bottom, beware that it may make it pretty laggy)
the corpus text is in the Sentences table, the Lemmata table is basically a glossary of all the words in the corpus, with translation and often etymological info, the Morphology table contains the whole corpus text broken down into morphological units with POS-tags and comments
it offers expected normalized spellings of corpus forms
the data can be exported to CSV files

Pages in category "Text corpora"

The following 2 pages are in this category, out of 2 total.

G

Gaelic/Using Corpas na Gàidhlig

I

Irish/Using Nua-Chorpas na hÉireann

@@ Line 2: / Line 2: @@
 Below is a list of useful publicly-available corpora for Celtic languages.
+== Brittonic ==
+=== Breton ===
+=== Cornish ===
+* [https://www.akademikernewek.org.uk/corpus/?locale=kw#en_ Akademi Kernewek's Corpus search engine]
+* [https://skrifakernewek.miraheze.org/wiki/Cornish_literature_by_year A list of Cornish language publications by year, both for traditional texts and the Revival]
+=== Welsh ===
 == Goidelic ==
 === Irish ===
-* [https://corpas.focloir.ie ''Nua-Chorpas na hÉireann''] (''New Corpus for Ireland''), also known as the ''Foclóir.ie corpus''
+* [http://corpas.focloir.ie ''Nua-Chorpas na hÉireann''] (''New Corpus for Ireland''), also known as the ''Foclóir.ie corpus''
 :* '''requires registration''' (free-of-charge)
 :* all kinds of modern texts (fiction, poetry, official documents, newspaper articles, etc. from 20th and 21st century) written by both native and non-native speakers
@@ Line 13: / Line 23: @@
 :* uses [https://en.wikipedia.org/wiki/Sketch_Engine Sketch Engine]
 :* alternative new interface at [https://focloir.sketchengine.eu focloir.sketchengine.eu]
-:* See [[Irish/Using corpas.focloir.ie]] for some additional tips
+:* See [[Irish/Using Nua-Chorpas na hÉireann]] for some additional tips
-* [http://corpas.ria.ie ''Historical Irish Corpus''], also known as the ''RIA corpus''
+* [http://corpas.ria.ie ''Corpas Stairiúil na Gaeilge''] (''Historical Irish Corpus''), also known as the ''RIA corpus''
 :* publicly available
 :* literary texts composed in Irish between 1600 and 1926
 :* part-of-speech tagged but with limited search functionality
 :* allows searching for words in original spelling and by their modern standardized forms
+* [https://www.corpas.ie ''Corpas Náisiúnta na Gaeilge''] (''National Corpus of Irish''), also known as the ''Gaois corpus''
+:* publicly available
+:* all texts compiled for the corpus span the 2000-2024 period
+:* part-of-speech tagged
+:* uses [https://en.wikipedia.org/wiki/Sketch_Engine Sketch Engine]
+=== Manx ===
+* [https://corpus.gaelg.im/ Manx Corpus Search]
+:* publicly available
+:* over 600 texts between 1610 and present, accompanied by English translations
+:* focus on pre-1908 native Manx literature with the aim to store '''everything''' written in Manx before 1908
+:* open source, the search interface software hosted at [https://github.com/david-allison/manx-corpus-search https://github.com/david-allison/manx-corpus-search], corpus data at [https://github.com/david-allison/manx-search-data https://github.com/david-allison/manx-search-data]
+:* if you need assistance, have a feature request, etc. you may contact the corpus’ maintainer on Github or on Celtic Languages Discord (Discord handle ''davida0813'')
 === Scottish Gaelic ===
 * [https://dasg.ac.uk/corpus/ ''Corpas na Gàidhlig'' DASG]
-:* corpus of literary texts in Scottish Gaelic, from 12th century to 21st century (but mostly modern, 18th–21st c. texts)
+:* publicly available
-:* '''not''' (?) part-of-speech tagged
+:* contains literary texts in Scottish Gaelic, from 12th century to 21st century (but mostly modern, 18th–21st c. texts; might be extended in the future with transcriptions of spoken language)
-:* based on [https://cwb.sourceforge.io/ Corpus Workbench (CWB)] and [https://cwb.sourceforge.io/cqpweb.php CQPWeb]
+:* '''not''' part-of-speech tagged
+:* based on [https://cwb.sourceforge.io/ Corpus Workbench (CWB)] with modified [https://cwb.sourceforge.io/cqpweb.php CQPWeb] interface
 :* allows filtering by time period, geographical origin, particular text
-:* allows for complex queries using the CQP language (or at least a limited subset thereof)
+:* allows complex queries using the CQP-syntax (but limited due to the lack of POS-tagging)
+:* See [[Gaelic/Using Corpas na Gàidhlig]] for some additional tips
 === Classical Gaelic ===
@@ Line 36: / Line 61: @@
 :* uses [https://en.wikipedia.org/wiki/Sketch_Engine Sketch Engine]
 * [http://corpas.ria.ie ''Historical Irish Corpus''] (corpus RIA) – mainly a Modern Irish corpus, but also useful for Early Modern texts and sometimes bardic poetry since it contains texts from 1600 and later, see under [[#Irish|Irish]]
+=== Old Irish ===
+* [https://chronhib.maynoothuniversity.ie/chronhibWebsite/tables ''Corpus Palaeo-Hibernicum''] – created by the ChronHib project
+:* 78 Old Irish texts: OIr. glosses, Annals of Ulster, poems of Blathmac, and some tales
+:* fully translated and with full annotations of morphological forms
+:* it’s structured around a spreadsheet-like table interface – searching for specific forms requires displaying the '''whole''' table though (so choose the biggest “Results per Page” value at the bottom, beware that it may make it pretty laggy)
+:* the corpus text is in the ''Sentences'' table, the ''Lemmata'' table is basically a glossary of all the words in the corpus, with translation and often etymological info, the ''Morphology'' table contains the whole corpus text broken down into morphological units with POS-tags and comments
+:* it offers expected normalized spellings of corpus forms
+:* the data can be exported to CSV files

Difference between revisions of "Category:Text corpora"

Latest revision as of 06:55, 22 August 2025

Contents

Brittonic

Breton

Cornish

Welsh

Goidelic

Irish

Manx

Scottish Gaelic

Classical Gaelic

Old Irish

Pages in category "Text corpora"

G

I

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Language