CorpAfroAs, a Corpus for spoken Afroasiatic languages

Access the Corpus

  • The initial access to the CorpAfroAs corpus page displays to the left a list of the languages involved in the project, grouped by family. When a language is checked, the files corresponding to this language are listed to the right.
  • The files of the corpus are identified by their titles and by their filenames (identifier). Each item in the corpus is linked to an audio file and its corresponding ELAN annotation file, except for the PDF files that are grammatical sketches on the language.
  • An information button will display the OLAC metadata for each pair of related files (or PDF file).
  • To the top right of the screen, there is a Register button allowing you to ask for a login and password that will give you access to the corpus. In order to register, you will have to accept the Copyright and citation rules and the Ethical rules.
  • To the right of the register button, there is a Connect button allowing you to connect to the query engine to search the corpus.
  • To access the query engine, check the files you are interessed in (without connection, only a few sample files may be searched for experimentation purposes), then click on the OK button to the top of the list.
  • Once connected (login and password accepted), the annotated files you are given right to display can be showed by clicking on their identifier name (e.g ARY_AB_NARR_02)
  • The files you are given right to research are preceded by a checkbox,
  • The files whose checkbox have been checked will form the domain in which the research will be carried out.
    To select or unselect a group of files belonging to the same language check or uncheck the language on the left list. To select or unselect all the files belonging to a family language, click on the name of the family to the left list.
  • Other buttons appear depending on your specific rights.
  • The ELAN icon to the end of each filename (depending on your access rights) allows the downloading of the corresponding ELAN file. You will have to accept the downloading rules to be able to download the file. As ELAN files are text files, the content of the file may be displayed on your browser. Just do File, Save as to save the file on your computer.
  • The WAV icon (depending on your access rights) allows the downloading of the corresponding audio file. The file will open in your sound player, just do File, Save as to save the file on your computer.
  • To search the sub-corpus defined by the checked files, click the OK button to the top of the list.
  • The lists and concordances form

    The query engine used for the CorpAfroAs corpus is based on the mfSearch package from the Max Planck Institute for Psycholinguistics, Nijmigen. The Search interface presents three areas:
  • The Search domain previously defined (here 17 files). When the mouse is moved over this area, the list of files list is displayed in a popup.

  • A Concordance and Lists area.
  • the List button creates the list of the different words (mot), morphemes (mb), gloss (ge) or categories (rx) of the sub-corpus defined by the search domain, with their number of occurences. Depending on the order choice, this list can be alphabetical or arranged by decreasing number of occurences.

    From the list, one can access the occurences of an item by clicking on its value (here RECP).

    From that page, it's possible to

    • display an annotated prosodic unit contanining the item by clicking on its identifier. The unit can then be played.
    • select different annotation units containing the item to display them in a new page by clicking the Show selected items button.
  • The Concordance button creates a list of the words matching the regular expression given in the word concordances box, with their left and right contexts. The words matching the regular expression are centered into the page.
    The prosodic units where these words appear can be displayed (then listened to) one by one by clicking on their corresponding identifier. (Here, the concordance of the word 'o:mhi:n', then the displaying of the prosodic unit BEJ_MV_NARR_02_farmer_313)

    The unit display can then be enlarged (in term of prosodic units) on both sides by giving the number of units desired on the left and on the right, then by clicking the Extend display button.

  • The Search form
    • case sensitive : uppercase and lowercase are not equivalent
    • regular expression : how the search targets and contexts must be interpreted (cf. bottom of the page)
    • minimal duration : search only in units of this minimal duration (0 = any duration)
    • maximal duration : search only in units of this maximal duration (0 = any duration)
  • The command line searching : in this box, one can write a query in the specific CorpA query language. In the screenshot below, the search is : look for the label OBL in the gloss tiers type (ge) fully aligned with whatever (.) in the morphem tiers type (mb)
  • The graphical searching interface : it is the same graphical interface than ELAN's multiple files, multiple layers one.
  • Target : searched sequence. Don't forget to specify, at the right of the layer, the tier (or tier type) where you want to search for this sequence (morpheme, word, gloss or category tiers...). (The screnshot shows how to express in this graphical interface, the same request as in the command line searching.

    Multiple layer search
    You can refine your initial search by adding vertical constraints in the layer below. In this case, you will have to choose the type of constraint you want to impose to the targets.

    • Fully aligned : both annotation cells must have the same temporal duration
    • Inside : the upper cell must be a child of the bottom one (like 'ge' child of 'mb')
    • Within : the upper cell must be a parent of the upper one (like 'mot' parent of 'mb')
    • Overlap : there must be a temporal overlap between the upper cell and the bottom cell
    For example, after having found 1566 morphemes tagged as 'demonstrative' in the corpus ('DEM' in 'ge' type tiers), one would like to know how many are of 'proximal' type ('PROX' in 'rx' type tiers).
      Target
      \bDEM\bTier Type: rx
      fully aligned
      \bPROX\bTier Type: ge
    '\b' means a word boundary (cf. regular expressions below)

    (In the second layer of the previous screenshot, the regular expression '.' (meaning any sequence of character) searched for in the 'mb' type tiers, is not actually 'a constraint', but a mean to capture the various values, =ha, -ti, =u... of the morphemes tagged as 'OBL', as illustrated by the hits screenshot)

    Context : It is possible to add constraints regarding the context of the target (horizontal constraints), i.e. on the left and right environment of the searched target, at a fixed distance (=x) or a limited one (<x) in number of annotations.
    For example, to find the nouns ('N' in 'rx' type tiers) directly followed by a determiner ('DET' in 'rx' type tiers), one will search:

      TargetDistance  Right Context
      \bN\b= 1\bDET\bTier Type: rx
    = 1 between Target and Right Context means 'at the distance of one annotation to the right'.
    = 0 would mean 'in the same annotation'
    < 2 between Left Context and Target would mean 'with an annotation containing the left target sequence at a distance of zero or one annotation at right of the target one

    -> The Clear button clears the form content
    -> The Find button launches the search

    Regular expressions
    Regular expressions provide flexible means to match strings of text like 'beginning with', 'ending with', 'any from a list'... By default, the sequences in target or context boxes are searched inside the annotations. Then, for example, searching label 'PFV' will retrieve also the annotations labelled 'IPFV', 'IPFV.3SG.F'...
    \b is a mark for a word frontier (beginning, end, ponctuations). ex: \bIPF\b = only the 'IPFs' inside an annotation (complex or not)
    ^means beginning of the text. ex: ^N = all annotations beginning by N; ^- = all annotations that are suffixes (in this corpus, prefixes present a hyphen to the right, suffixes, a hyphen to the left)
    $end of text. ex: -$ = all annotations that are prefixes
    .any single character. (if nothing after, it will be interpreted as 'any sequence of characters')
    \. or [.]the character '.'
    ?the previous character or no character. e.g.: 'gr?ave' will match 'gave' and 'grave'
    [?]the character '?'
    +the previous character at least one time. e.g.: 'me+t' will match 'met', 'meet'...
    [aeiou]one of these vowels. e.g.: 'p[aeiou]pe' will match all the words 'pape', 'pipe', 'pepe'...
    [^ptk]any character but 'p', 't' or 'k'
    [a-h]any letter between 'a' and 'h'
    NOT()annotation not containing the text between parenthesis. ex: NOT(\.) in rx or ge = the plain annotations (not complex, i.e without '.')