Gaelic/Using Corpas na Gàidhlig

From Celtic Languages
Jump to navigationJump to search
The corpus has a new interface and requires registering and account now – the guide is thus slightly outdated – but the basics of writing queries should stay exactly the same.

Corpas na Gàidhlig (the DASG corpus) is a publicly available corpus of Scottish Gaelic literary texts (from 1200 to 21st century, though most texts are modern, ie. 19th century and later) – books, newspapers, poems, advertisements. It will possibly be expanded with non-literary texts later too (eg. transcriptions of recorded folk tales).

The interface of the corpus uses the open source Corpus Workbench (CWB) software and a custom modification of the CQPweb web interface.

The texts included in the corpus are annotated with information about the time priod they’re from, the literary type of the work, its author, etc. Unfortunately, the words are not annotated with part-of-speech tags and there’s no meta-information about structure of the sentences which limits somehow queries that are possible. Still, the interface allows users to use wildcards in the queries, use the CQP query syntax to make complex queries.

This article is not meant to be a comprehensive documentation of the corpus’s interface, but its aim is to provide some tips how to use it effectively.

Simple query

When you go to the Corpus home page you’ll see a ⟨Standard Query⟩ form with a text area for inputing simple query. This is the simplest way to search texts in the corpus. If you insert a word or a sentence, it’ll try to find occurences of the string of words as provided, in a case-insensitive manner by default. That means that if you type tha, you’ll get results with both sentence-initial Tha ‘is’ and tha in the middle, but you won’t find other words beginning with tha. If you type cha robh leughadh aige you’ll find a sentence beginning with Cha robh leughadh aige ‘He could not read’.

All punctuation marks are treated as their own tokens, so if you input any commas, dots, apostrophes, etc. they’ll get separated from the words and they also have to match the texts. Thus the queries b ' fheàrr leam and b fheàrr leam will return different result sets. You don’t need to type a space between a word and the apostrophe yourself – it will be added automatically by CQPweb (so if you type b' fheàrr leam the query will automatically change to b ' fheàrr leam).

But simple queries are much more powerful. They allow you to use some wildcards and also group words using parentheses. You can use those special characters to define words in your queries:

  • ? to substitute any single character, that is leu?hadh will find all occurences of two different spelling variants of the verbal noun ‘reading’: leughadh and leubhadh;
  • * to substitute any string of characters (ie. zero or more of anything), so you can for example type leugh* to find all the forms of the verb leugh ‘read’ (leugh, leughadh, leughaidh, leughainn, etc. – anything starting with leugh, including leugh itself), and * in itself means ‘any or no token’ (tha * agam will find instances of tha fhios agam ‘I know’, and tha dòchas agam ‘I hope’, and tha e agam ‘I have it’, etc. and it will find tha agam without any noun in between);
  • + to substitute at least one character (ie. one or more of anything), for example leugh+ will find leughaidh ‘reads, will read’ and leughadh ‘reading, would read’, but it won’t find leugh itself;
  • brackets [x,y] to list alternative strings of characters separated with commas, eg. [t,b]ha finds two forms of the verb bi ‘to be’: present independent tha and past independent bha; [leugh,sguab]adh will find both leughadh ‘reading’ and sguabadh ‘sweeping’.

To search for one of those characters verbatim in a text, you have to escape it with a backslash. So to find questions of the form A bheil cat agad? ‘Do you have a cat?’, ended in a question mark, with any single noun and in any grammatical tense, you could write:

A[m,n,] [bheil,robh,bi,biodh] + agad \?

Note the \? matching exactly question marks in texts. This will find phrases like Am bheil airgiod agad? ‘Do you have money?’, … an robh fhios agad? ‘… did you know?’, A bheil salainn agad? ‘Do you have salt?’, etc.

It won’t find any instances starting with A’ though – and because the apostrophe is treated as a separate token, adding it to the list of the characters after A (like eg. A[m,n,',]) won’t help. But you can use the parentheses ( and ) to group tokens and the syntax provides several additional tools to operate on those groups:

  • | (the pipe character) inside a group allows you to list alternative strings of tokens, eg. (cat mòr | cù beag) will find either cat mòr or cù beag, but not cat beag or cù mòr, etc.;
  • and after a group:
  • ? to make the group optional, eg. ri (a)? chèile will find both spellings for ‘to/with each other’: ri chèile and ri a chèile;
  • * to allow the group to be repeated any number of times, including zero;
  • + to require the group to match at least once but allow it to repeat;
  • {min, max} to require the group exactly between min and max times (where min and max must be numbers), eg. (b*){3,5} will match any string of at least 3 and at most 5 words starting with the letter b.

So you can do:

(A ' | A[m,n,]) [bheil,robh,bi,biodh] + agad \?

or

A[m,n,] (')? [bheil,robh,bi,biodh] + agad \?

and they both will find A’ bheil ceaird agad? written in the corpus as A ' bheil ceaird agad ? too. That’s because the first query allows anything that starts with either A ' or any of Am, An, or A, and the second query allows anything starting with A[m,n,] optionally followed by ' – they both allow A ' at the beginning.

Restricted query

The simple query syntax can be used also in restricted queries (switch to the ⟨Restricted Query⟩ tab above the query form). They allow you to restrict the set of texts to be search. This way you can for example limit the time period of the texts you’re interested in. Or the geographical location of the text (for example to see if a given expression exists in a specific dialect), etc.

For example if you’re interested in the change of the use of demonstratives like sin ‘that’ and seo, so ‘this’ with the compound preposition airson ‘for’ (which also can be spelt air son) through time – specifically how the ratio of forms like air a shon sin to forms like airson sin changed throughout centuries, you could write queries like these:

  1. the form with a possessive pronoun:
    air a shon (sin | s[e,]o | sud)
  2. and without the possessive:
    (airson | air son) (sin | s[e,]o | sud)

The first one will catch all instances with the possessive, air a shon sin ‘for that’, air a shon so, air a shon seo ‘for this’, etc. The second one will catch the ones without the possessive: airson sin, air son so, airson seo, airson sud, etc.

And now you run those queries in the ⟨Restricted Query⟩ tab first choosing the time period using the ⟨Date of Language⟩ button. The time periods are divided into three parts per century, eg. Early 19th c., Mid 19th c., Late 19th c., etc. You can choose multiple time periods (eg. to find all instances in a given century) or do a more fine-grained search.

Running the above queries over all periods from the Early 18th c. one by one gives these results:

  • Early 18th c.
  • possessive (air a shon sin, etc.): 1 match in 1 text,
  • no possessive (airson sin, etc.): 0 matches,
  • Mid 18th c.
  • possessive: 17 matches in 1 text,
  • no possessive: 1 match in 1 text,
  • Late 18th c.
  • possessive: 29 matches in 8 texts,
  • no possessive: 38 matches in 9 texts,
  • Early 19th c.
  • possessive: 33 matches in 7 texts,
  • no possessive: 24 matches in 8 texts,
  • Mid 19th c.
  • possessive: 42 matches in 16 texts,
  • no possessive: 62 matches in 15 texts,
  • Late 19th c.
  • possessive: 289 matches in 26 texts,
  • no possessive: 212 matches in 28 texts,
  • Early 20th c.
  • possessive: 81 matches in 31 texts,
  • no possessive: 108 matches in 29 texts,
  • Mid 20th c.
  • possessive: 33 matches in 13 texts,
  • no possessive: 72 matches in 22 texts,
  • Late 20th c.
  • possessive: 77 matches in 56 texts,
  • no possessive: 200 matches in 111 texts,
  • Early 21st c.
  • possessive: 3 matches in 3 texts,
  • no possessive: 42 matches in 12 texts.

We can see that we get much fewer results in the 18th and early 19th century – the corpus has more data in later periods. We also see that in the earliest periods the forms with possessive are at least as popular as the ones without it – the number of occurrences of both constructions are similar throughout the 19th century. The form with possessive seems to be more common in the early and mid-18th century, but that might be just an artifact of the data (only one text in each period) and late 18th c. shows a slight adventage of possessive-less form.

At any rate, it’s clear from the data that forms like air a shon sin were still in common use throughout the 20th century, coexisting with airson sin, but they almost disappeared by the early 21st century (although one can still find a couple examples).

CQP syntax

If you need more flexibility with your queries, you can use the CQP query syntax. This is only available in the ⟨Standard Query⟩, to use it you need to change the ⟨Query mode⟩ to ⟨CQP syntax⟩. CQP queries are not available in the Restricted queries.

This is basically the same syntax as CQL in the Irish Foclóir corpus – as the software used in the Irish corpus originally based its query system on the CWB queries system. But since the DASG corpus does not have any POS-tagging, CQP syntax doesn’t give you as much power as CQL gives in the Irish corpus.

Here also every single token – word or punctuation – is treated as a separate entity with textual attributes, but they have only one attribute available – the word attribute with the form as it appears verbatim in the text. To match any single token you write [] – this query will match every single word and punctuation mark in the texts. You can add conditions on the attributes inside the brackets to restrict it. For example you can list the attribute name, the = character, and a regex to find all words matching the regex, eg. [word = "[pP]h?iuthar"] to find all instances of the word piuthar, lenited or not lenited, beginning with a lowercase or uppercase P.

word is also the default attribute. That means that to query it with plain regexes you can omit the brackets completely: "[pP]h?iuthar" will find exactly the same tokens as [word = "[pP]h?iuthar"]. You can also use the %c flag after a regex to make it case-insensitive: [word = "ph?iuthar"%c] and "ph?iuthar"%c will find “piuthar”, “PHIUTHAR”, “Piuthar”, etc.

If you want to match a token verbatim (and not as a regex), you can add %l after it, eg. "."%l will match literal dots in the texts.

To search for longer phrases you can separate tokens with a space between them: "tha" %c [] "agam" %c finds all instances of “tha X agam” where “X” stands for any word (or punctuation mark). You can also group tokens and perform operations on them: ("tha"%c | "a"%c "[^a-zA-Z]"? "bheil"%c) [] "agam" %c will find both statements like “tha X agam” and questions like “A bheil X agam?” and “A' bheil X agam?” – the example uses "[^a-zA-Z]" (ie. not a letter) to match the apostrophe (the ' character) as I can’t find another way to match it in the CQP syntax in the DASG interface.

You can also add {x, y} after a group to require it to be repeated between x and y times, eg. ("aig" [] []){2,3} will match phrases like “aig Padruig agus aig a nighin”.

External documentation