Difference between revisions of "Irish/Using Nua-Chorpas na hÉireann"
m (→CQL) |
m (https doesn’t work) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The [ | The [http://corpas.focloir.ie ''New Corpus for Ireland'' or ''Nua-Chorpas na hÉireann''] (or ''the Foclóir corpus'') is a very useful tool for checking how some things are phrased in Irish and which expressions are used by native speakers and which ones are not. Unfortunately the corpus’s help page is not accessible and the UI isn’t very user-friendly. One can find some documentation for the software used there, but it’s not corpas-specific and thus not very helpful when working with this particular corpus of Irish. | ||
This page isn’t meant to be a comprehensive documentation of the corpus, but at least a list of hints that would make your work with the corpus a bit more efficient. For more comprehensive documentation, see [[#External documentation|external links]] below. | This page isn’t meant to be a comprehensive documentation of the corpus, but at least a list of hints that would make your work with the corpus a bit more efficient. For more comprehensive documentation, see [[#External documentation|external links]] below. | ||
Line 36: | Line 36: | ||
* <code>*</code> standing for any number of characters, | * <code>*</code> standing for any number of characters, | ||
* <code>?</code> for any single character in your queries, | * <code>?</code> for any single character in your queries, | ||
* <code>|</code> meaning ''or'', allowing you to list | * <code>|</code> meaning ''or'', allowing you to list multiple words or phrases, | ||
* [https://www.sketchengine.eu/guide/regular-expressions/#toggle-id-1 and some more]. | * [https://www.sketchengine.eu/guide/regular-expressions/#toggle-id-1 and some more]. | ||
Line 118: | Line 118: | ||
=== POS-tags === | === POS-tags === | ||
'''TODO''' | As was mentioned before, every single token (word or punctuation) in the corpus is tagged with part-of-speech information. The tags are stored as text values under the attribute <code>tag</code> of the tokens and they all follow the same general pattern. This section gives an overview of the format used in tags of the Irish corpus but it’s not extensive. For the extensive list of all information contained in the tagset, see [[#External documentation|the links below]]. | ||
<div class="warningbox">'''Beware!''' There are instances of tokens being tagged incorrently. So don’t rely on the tags too much and when you’re failing to find a specific form that interests you, try searching for it using the <code>word</code> or <code>lc</code> attributes. Also note that some wrongly tagged words will have their <code>lemma</code> wrong too. You can see what tag and lemma was assigned to specific tokens in your result list in the '''new''' interface (you can enable displaying of lemmata and pos-tags under the eye icon above the result set). | |||
For example: | |||
* the word '''Gaedhealaibh''' ‘Gaels’ has the tag <code>Npmcs</code> and lemma <code>Gaedhealaibh</code>, so it’s “a proper masculine noun ''Gaedhealaibh'' in its nominative singular form” even though in reality it is an old spelling of the dative plural form of the word ''Gael'', | |||
* the lenited form '''Ghaedhealaibh''' is tagges similarly, except that the lemma is given as <code>Ghaedhealaibh</code>, | |||
* the word '''Polanach''' ‘a Pole’ in its noun sense is tagged as <code>Ncmcs</code> (ie. a common nouns) with lemma <code>polannach</code> even though it is a proper noun and should be lemmatized as <code>Polannach</code>, | |||
* the words '''snámh''', '''léamh''', etc. in the construction ''tá snámh agam'' ‘I can swim’ are tagged as <code>Ncmsc</code> (common nouns) even though they are '''verbal''' nouns in this construction, | |||
* the words '''déanta''', '''tagtha''', etc. in the construction ''tá déanta agam'' ‘I have done…’ are tagged as <code>Nv--g</code> (verbal nouns in genitive) when in reality they are '''passive participles''' and thus should match <code>Av.*</code> (ie. verbal adjectives), | |||
* the word '''ráite''' in the same construction is marked <code>Ncmcp</code> – ie. as a ''common'' noun in ''plural'' (sic!) | |||
* on the other hand, forms like '''déanta''', '''pósta''', '''bpósta''' in phrases like ''chun a dhéanta'' ‘in order to do it’ are tagged with <code>Aqc</code> (a comparative adjective) or <code>Ncmsg</code> (a common noun in gen.) when in fact they’re verbal nouns in the genitive!</div> | |||
The first letter of the tag is uppercase and denotes the part of speech, the meaning of the following characters depends on the first one, but generally the next one is a subtype of the part of speech (eg. <code>N</code> means a noun, <code>Nv</code> means a ''verbal'' noun, etc.). Since you need to use [[#CQL|CQL]] and regex to use POS-tags in your searches, the following lists will provide '''regexes''' matching the forms in question. The list of basic parts of speech is as follows: | |||
* <code>N.*</code> – '''nouns''', | |||
* <code>V.*</code> – (finite) '''verbs''' – excluding verbal nouns and participles (at least in the places where they’re correctly tagged), | |||
* <code>A.*</code> – '''adjectives''' (including passive participles), | |||
* <code>P.*</code> – '''pronouns''', including personal pronouns (but not possessives), “conjugated prepositions” aka “prepositional pronouns”, and sometimes demonstratives (like ''seo, sin'', etc.), | |||
* <code>D.*</code> – '''determiners''' like ''seo'', ''sin'', ''úd'', ''eile'', ''uile'', ''aon'', ''gach'', and also possessive pronouns like ''mo'', ''do'', ''a'', and interrogatives like ''cé'', ''cá'', | |||
* <code>T.*</code> – '''articles''', really just <code>Td.*</code>, forms of the definite article, | |||
* <code>R.*</code> – '''adverbs''' of place, means, time, etc., | |||
* <code>S.*</code> – '''adpositions''', basically always <code>Sp.*</code> – '''prepositions''' (''le'', ''leis'' before the article, ''ar'', ''i'', etc., also compound prepositions like ''tar éis'' and similar), | |||
* <code>C.*</code> – '''conjunctions''', | |||
* <code>M.*</code> – '''numerals''', | |||
* <code>I.*</code> – '''interjections''', | |||
* <code>U.*</code> – “unique membership class” which means various types of '''particles''', elements common in surnames, etc., | |||
* <code>X.*</code> – “residuals”: foreign words, abbreviations, dates, numbers, and things the tagger considered “unknown”, | |||
* <code>F.*</code> – '''punctuation''' marks, | |||
* <code>Y.*</code> – common '''abbreviations''', | |||
* <code>W.*</code> – '''copula''' (''is'', ''ba'', ''gur''…), | |||
* <code>Q.*</code> – verbal '''particles''' like ''ní'', ''an'', etc. | |||
As for some more important additional specific information in the tags: | |||
* '''nouns''' | |||
:* have subtypes: | |||
::* <code>Nc.*</code> – common nouns, | |||
::* <code>Nv.*</code> – verbal nouns, | |||
::* <code>Np.*</code> – proper nouns, | |||
::* <code>Ns.*</code> – other substantivized words, | |||
:: then they have gender and number, eg.: | |||
::* <code>N.ms.*</code> – singular masculine nouns, | |||
::* <code>N.fp.*</code> – plural feminine nouns, etc. | |||
:* case: | |||
::* <code>N...c.*</code> – the “common” case, ie. nominative or dative (note that for separate dative forms there is different value), also as direct objects of verbs, | |||
::* <code>N...g.*</code> – genitive, | |||
::* <code>N...v.*</code> – vocative, | |||
::* <code>N...d.*</code> – dative, when a separate dative form is used (but note that it’s very imperfectly tagged, dative plurals are not included, etc.); | |||
* '''verbs''' | |||
:* always have the subtype “main” <code>Vm.*</code>, | |||
:* mark mood: | |||
::* <code>Vmi.*</code> – indicative, | |||
::* <code>Vms.*</code> – subjunctive, | |||
::* <code>Vmm.*</code> – imperative, | |||
::* <code>Vmc.*</code> – conditional, | |||
:* tense: | |||
::* <code>Vm.p.*</code> – present, | |||
::* <code>Vm.s.*</code> – past, | |||
::* <code>Vm.h.*</code> – past habitual, | |||
::* <code>Vm.f.*</code> – future, | |||
::* <code>Vm.g.*</code> – present habitual, | |||
:* person: | |||
::* <code>Vm..1.*</code> – first person, | |||
::* <code>Vm..2.*</code> – second person, etc. | |||
::* <code>Vm..0.*</code> – autonomous forms, | |||
:* number: | |||
::* <code>Vm...s.*</code> – singular, | |||
::* <code>Vm...p.*</code> – plural, | |||
* '''TODO: some other parts-of-speech''' | |||
Sometimes a word has multiple tags – that happens when the tagger could not assign one tag unambiguously. In such instances the tags are separated with the character <code>|</code>. Thus, to find '''all''' instances that the tagger considered ''possible'' plural verbs one has to write <code>.*Vm...p.*</code> (or <code>(.*|)?Vm...p.*</code>) to account for the other tags before the verbal tag. | |||
== External documentation == | == External documentation == |
Latest revision as of 13:18, 6 August 2023
The New Corpus for Ireland or Nua-Chorpas na hÉireann (or the Foclóir corpus) is a very useful tool for checking how some things are phrased in Irish and which expressions are used by native speakers and which ones are not. Unfortunately the corpus’s help page is not accessible and the UI isn’t very user-friendly. One can find some documentation for the software used there, but it’s not corpas-specific and thus not very helpful when working with this particular corpus of Irish.
This page isn’t meant to be a comprehensive documentation of the corpus, but at least a list of hints that would make your work with the corpus a bit more efficient. For more comprehensive documentation, see external links below.
First steps
To use the corpus, you first have to create an account using the registration form. Registration is free, but you will have to wait until your account is accepted before you’ll be able to log in and use the corpus.
Old and new interface
When you log in, you’ll see the old Sketch Engine web interface. You can use it but it is also possible to access the new interface by logging into focloir.sketchengine.eu instead. The new interface is generally much more user-friendly (and compatible with the official Sketch Engine documentation) but beware: some features don’t work with it (for example word sketches work in the old interfaces, but they don’t in the new one).
You can follow this guide in the old interface, unless it refers explicitly to the new one.
Simple querying
When you log into the corpus, you’ll see the ⟨Home⟩ (⟨Leathanach Tosaigh⟩) screen with an input form to perform a simple search. As the prompt says, you can type words or phrases in there. If you type a lemma form of a word (ie. the base form that you’d find in a dictionary), it will search for any occurrence of that word in any form in the corpus. And it will treat every word in a phrase this way.
This means that if you type bí madra ag
(‘to be, dog, at’), you’ll see results such as:
- bhí madra agamsa ‘I had a dog’,
- tá madraí aige siúd ‘that one has dogs’, etc.
You’ll also see the number of all results at the top (Hits: 31 or Amas: 31).
If a word you type in is not a lemma form, only sentences that match this form exactly will be found. So if you type bí madraí ag
(‘to be, dogs, at’), you’ll get results like:
- Beidh madraí ag Waterloo ‘there will be dogs at Waterloo’,
- tá madraí aige siúd ‘that one has dogs’,
but no instances of singular madra (and you’ll see that the number of results fell down to 8).
The default search is case-insensitive, you can type both madraí
or MADRAÍ
and you’ll get the same set of results.
Wildcards (new interface only)
If you use the new interface, you can perform simple searches in the ⟨Concordance⟩ tab with ⟨Simple⟩ query type chosen. You can use wildcard characters:
*
standing for any number of characters,?
for any single character in your queries,|
meaning or, allowing you to list multiple words or phrases,- and some more.
Thus you can for example type bí * ag
and get all occurences of the verb bí and its forms (tá, raibh, beidh, etc.) followed by any word, followed by any form of the preposition ag, thus you’ll get a result list containing:
- tá feidhm ag na fóralach cosúlacha…,
- … nach raibh feidhm aige…,
- Ceangaltais a bheidh déanta ag an gComhphobal, etc.
You can also type just a part of the word, eg. feoil*
will find occurrences of every word starting with feoil and whose lemma starts with feoil, those the list will include: feoil, feola, feoilséantóir, mhuicfheoil (it’s lemmatized as feoil), feoilmhian, etc. If you type Ga?l
you’ll get results for both Gael and Gall (and also gaol and gail).
You can use the |
to list multiple words or phrases that are supposed to match in your query, eg. if you type snámh|léamh
you’ll find all instances of the verbal nouns snámh ‘swimming’ and léamh ‘reading’, if you type bí snámh ag|bí léamh ag
you’ll find all instances of the tá ⟨verbal noun⟩ agam ‘I can ⟨verb⟩’ construction with the verbal nouns for ‘swim’ and ‘read’, regardless of tense or grammatical person.
Filtering the results
If you want to filter the results using criteria like texts written only by native speakers or only Munster Irish, you need to enter the Concordance screen. To do that you need to click ⟨>> More⟩ (⟨>> Tuilleadh⟩) under the results list, then in the menu on the left click ⟨Filter⟩ (⟨Scagaire⟩), and that will bring you to a screen where you can select your filtering criteria and confirm them by clicking ⟨Filter Concordance⟩ (⟨Déan an Comhchordacht a Scagadh⟩). This will take you to the concordance results screen with the results filtered.
CQL
If you require more power – you want to find utterances matching complex patterns – you can always reach for CQL by entering the ⟨Concordance⟩ (⟨Comhchordacht⟩) tab and switching the query type to ⟨CQL⟩.
The Corpus Query Language (CQL) allows you to make complex regex-like queries, including things like looking for phrases containing specific parts of speech or inflectional forms – that’s possible because every word in the corpus is tagged with information about its part-of-speech and inflectional form. Using CQL is more complex than simple searching for words, but it enables you to be much more flexible in your searches.
Every single token (word or punctuation mark) in the Foclóir corpus has multiple attributes with text values associated with it. CQL allows you to query values of those attributes. The available attributes are:
word
– the word itself, verbatim as it appears in the text,lc
(word (lowercase)) – the word but with all uppercase characters changed to lowercase,lemma
– the lemma, ie. the base dictionary form of the word,lemma_lc
(lemma (lowercase)) – the lemma with all characters changed to lowercase,tag
– the part-of-speech tag with inflectional information for the word,lempos
– lemma and a single character representing the part-of-speech, separated with a dash.
For example the sentence-initial verb Tá ‘is’ will have its word
value equal to Tá
, lc
=tá
, lemma
=bí
, lemma_lc
=bí
, tag
=Vmip
, lempos
=bí-v
. Some details of the tag
format will be explained later. Most words will have their lemma
and lemma_lc
values equal, but some words that are typically written with initial capital will have those two different (eg. Éireannach ‘an Irishman’ has lemma
=Éireannach
and lemma_lc
=éireannach
).
The general most basic element of CQL queries is a pair of square brackets matching a token: []
– this means ‘any word or punctuation mark’. You can add a condition in a form attribute = "value-regex"
inside the bracket to limit what the query will match.
For example the query [lemma="bí"]
will match every occurrence of any form of the verb bí (the results will be the same as when typing bí
in the simple search). The query [word="bí"]
will find all instance of all-lowercase bí (and only bí, not any other form) in the corpus – this is impossible to get using the simple query.
When you type attribute = "value"
the value is treated as a regex, so you can for example write [word = "dra((ío)|(oidhea))cht"]
to find all occurrences of both – the old (draoidheacht) and new (draíocht) – spellings of the word draíocht ‘magic, witchcraft, druidism’. You can also type [word = ".*nnach[td]"]
to find all the words ending in -nnacht or -nnachd.
Thus, to find all instances of the tá ⟨verbal noun⟩ agam ‘I can ⟨verb⟩’ construction with any form of the verb tá, for any grammatical person, with the verbal nouns snámh ‘swim’ and léamh ‘read’ (including old spelling léigheamh), you can write:
[lemma = "bí"] [lc = "snámh|lé(ighe)?amh"] [lemma = "ag"]
(which at the time of writing doesn’t find any instance with the old spelling, only snámh and léamh).
You can omit the brackets and attribute name for the default attribute, so for example if you choose ⟨Default attribute⟩ (⟨Aitreabúid réamhshocraithe⟩) in your search to be ⟨word (lowercase)⟩, you can shorten the above to:
[lemma = "bí"] "snámh|lé(ighe)?amh" [lemma = "ag"]
If you want to search for a specific string without treating it as a regex, you can use double equals sign ==
, for example [word == "."]
will find all the instances of the period punctuation mark.
You can combine multiple conditions and perform Boolean logic on them using the following operators: &
(and), |
(or), !
(not), and group them with parentheses ( )
. So, given the information that adjectives in the corpus have tag
s beginning with the letter A
, you can find all correctly tagged sequences of two consecutive adjectives with lenition marked in writing with:
[word = "[cptgbdms]h.*" & tag = "A.*"] [word = "[cptgbdms]h.*" & tag = "A.*"]
(or using the lempos
attribute as [lempos = "[cptgbdms]h.*-j"] [lempos = "[cptgbdms]h.*-j"]
, don’t ask me why adjectives have -j there).
You can also a question mark ?
after a token to mark it as optional, and braces with two numeric values: {x, y}
to say that you want the given token query to match between x
and y
times.
For example to find most instances of the construction is maith/fearr liom ⟨verbal noun phrase⟩ ‘I like/prefer to ⟨infinitive phrase⟩’, regardless of person (liom, leat, leis an bhfear…) or whether the verbal noun has an object (arán a ithe ‘to eat bread’) or not (snámh ‘to swim’), you could write something like:
[tag = "W.*"] [lemma = "maith"] [lemma = "le"] ([tag = "[TD].*"]? [tag = "N.*"] [tag = "A.*"]{0,5} [tag = "D.*"]?)? ([tag = "Dp.*"] | ([tag = "[TD].*"]? [tag = "N.*"] [tag = "A.*"]{0,5} [tag = "D.*"]? [word = "a|do"]) | ([word = "mé|t(h)?ú|é|í|sinn|sibh|iad"] [word = "a|do"]))? [tag = "Nv.*"]
Let’s break this monstrosity down into parts:
[tag = "W.*"]
– the copula, it has tags beginning in W,[lemma = "maith"]
– the adjective maith or any of its forms (lenited mhaith, comparative f(h)earr, etc.),[lemma = "le"]
– the preposition le or any of its forms (leis, liom, etc.),( … )?
– optionally a group representing a noun phrase subject, containing:
[tag = "[TD].*"]?
– either an article (an, na), tags beginning with T or a determiner (like possessive mo, a, etc.) – tags beginning in D,[tag = "N.*"]
– a noun,[tag = "A.*"]{0, 5}
– zero up to five adjectives,[tag = "D.*"]?
– an optional following determiner (like seo or sin),
( … )?
– optionally a group representing the direct object, containing:
… | … | …
– either:
[tag = "Dp.*"]
– a possessive pronoun (mo ‘my’, do ‘your’, a ‘his/her/their’, etc.),- or
( … )
– another noun phrase group:
[tag = "[TD].*"]? [tag = "N.*"] [tag = "A.*"]{0,5} [tag = "D.*"]?
– article, noun, adjectives, determiner, like before,[word = "a|do"]
– the particle a (or its older form do) verbatim after the direct object,
- or
([word = "mé|t(h)?ú|é|í|sinn|sibh|iad"] [word = "a|do"])
– a pronoun object followed by a or do,
[tag = "Nv.*"]
– finally, the verbal noun.
This query still doesn’t catch everything (it doesn’t allow any genitive attributes in the noun phrases, for example) – but it shows the flexibility you get with CQL.
To learn more visit the official guide to Corpus Query Language linked below.
POS-tags
As was mentioned before, every single token (word or punctuation) in the corpus is tagged with part-of-speech information. The tags are stored as text values under the attribute tag
of the tokens and they all follow the same general pattern. This section gives an overview of the format used in tags of the Irish corpus but it’s not extensive. For the extensive list of all information contained in the tagset, see the links below.
word
or lc
attributes. Also note that some wrongly tagged words will have their lemma
wrong too. You can see what tag and lemma was assigned to specific tokens in your result list in the new interface (you can enable displaying of lemmata and pos-tags under the eye icon above the result set).
For example:
- the word Gaedhealaibh ‘Gaels’ has the tag
Npmcs
and lemmaGaedhealaibh
, so it’s “a proper masculine noun Gaedhealaibh in its nominative singular form” even though in reality it is an old spelling of the dative plural form of the word Gael, - the lenited form Ghaedhealaibh is tagges similarly, except that the lemma is given as
Ghaedhealaibh
, - the word Polanach ‘a Pole’ in its noun sense is tagged as
Ncmcs
(ie. a common nouns) with lemmapolannach
even though it is a proper noun and should be lemmatized asPolannach
, - the words snámh, léamh, etc. in the construction tá snámh agam ‘I can swim’ are tagged as
Ncmsc
(common nouns) even though they are verbal nouns in this construction, - the words déanta, tagtha, etc. in the construction tá déanta agam ‘I have done…’ are tagged as
Nv--g
(verbal nouns in genitive) when in reality they are passive participles and thus should matchAv.*
(ie. verbal adjectives), - the word ráite in the same construction is marked
Ncmcp
– ie. as a common noun in plural (sic!) - on the other hand, forms like déanta, pósta, bpósta in phrases like chun a dhéanta ‘in order to do it’ are tagged with
Aqc
(a comparative adjective) orNcmsg
(a common noun in gen.) when in fact they’re verbal nouns in the genitive!
The first letter of the tag is uppercase and denotes the part of speech, the meaning of the following characters depends on the first one, but generally the next one is a subtype of the part of speech (eg. N
means a noun, Nv
means a verbal noun, etc.). Since you need to use CQL and regex to use POS-tags in your searches, the following lists will provide regexes matching the forms in question. The list of basic parts of speech is as follows:
N.*
– nouns,V.*
– (finite) verbs – excluding verbal nouns and participles (at least in the places where they’re correctly tagged),A.*
– adjectives (including passive participles),P.*
– pronouns, including personal pronouns (but not possessives), “conjugated prepositions” aka “prepositional pronouns”, and sometimes demonstratives (like seo, sin, etc.),D.*
– determiners like seo, sin, úd, eile, uile, aon, gach, and also possessive pronouns like mo, do, a, and interrogatives like cé, cá,T.*
– articles, really justTd.*
, forms of the definite article,R.*
– adverbs of place, means, time, etc.,S.*
– adpositions, basically alwaysSp.*
– prepositions (le, leis before the article, ar, i, etc., also compound prepositions like tar éis and similar),C.*
– conjunctions,M.*
– numerals,I.*
– interjections,U.*
– “unique membership class” which means various types of particles, elements common in surnames, etc.,X.*
– “residuals”: foreign words, abbreviations, dates, numbers, and things the tagger considered “unknown”,F.*
– punctuation marks,Y.*
– common abbreviations,W.*
– copula (is, ba, gur…),Q.*
– verbal particles like ní, an, etc.
As for some more important additional specific information in the tags:
- nouns
- have subtypes:
Nc.*
– common nouns,Nv.*
– verbal nouns,Np.*
– proper nouns,Ns.*
– other substantivized words,
- then they have gender and number, eg.:
N.ms.*
– singular masculine nouns,N.fp.*
– plural feminine nouns, etc.
- case:
N...c.*
– the “common” case, ie. nominative or dative (note that for separate dative forms there is different value), also as direct objects of verbs,N...g.*
– genitive,N...v.*
– vocative,N...d.*
– dative, when a separate dative form is used (but note that it’s very imperfectly tagged, dative plurals are not included, etc.);
- verbs
- always have the subtype “main”
Vm.*
, - mark mood:
Vmi.*
– indicative,Vms.*
– subjunctive,Vmm.*
– imperative,Vmc.*
– conditional,
- tense:
Vm.p.*
– present,Vm.s.*
– past,Vm.h.*
– past habitual,Vm.f.*
– future,Vm.g.*
– present habitual,
- person:
Vm..1.*
– first person,Vm..2.*
– second person, etc.Vm..0.*
– autonomous forms,
- number:
Vm...s.*
– singular,Vm...p.*
– plural,
- always have the subtype “main”
- TODO: some other parts-of-speech
Sometimes a word has multiple tags – that happens when the tagger could not assign one tag unambiguously. In such instances the tags are separated with the character |
. Thus, to find all instances that the tagger considered possible plural verbs one has to write .*Vm...p.*
(or (.*|)?Vm...p.*
) to account for the other tags before the verbal tag.
External documentation
- Irish tagset – extensive list of part-of-speech tags available in the corpus
- Sketch Engine User Guide – a guide to newer version of the software the Corpas is using. The graphical interface presented in the guide is completely different to what you’ll find on corpas.focloir.ie, but the principles described there will generally be valid for the Corpas too. Among things you’ll find there are: