Recent Changes - Search:






Working Group







Developing Corpora as authentic language resource

Kim Wallmach
Ben Akoh

Project Summary

To develop corpora of authentic written and spoken text in a number of key African languages on the continent to use as a resource for localisation applications as well as language specific research.

Problem Definition

Assumptions about language need to be verified using authentic texts. Language research cannot be carried out or improved based on assumptions only hence the need for baseline database of text whether spoken or written that are available on computer in a format that allows for analysis using computer tools.

Development of corpora is expensive, time consuming and large scale. Authentic text, whether written or spoken, is needed not only to preserve but to enable ICT base localisation and research.

At the moment, our assumptions about African languages are based on very small samples or on simple intuitions. In order to make generalised conclusions about the character of a language, one needs to have massive resources. For instance in the UK, there exists the British National Corpus which consists of several million words, and allows researchers access to an authoritative, verifiable resource of English.


  • To design the corpus architecture in such a way that the product is replicable across all participating language corpora so that comparison and analysis is possible. Also that small, short term projects (including parallel corpora) can be introduced into a larger project in a longer term.
  • It should be open source and possible to search online
  • The database needs to be maintained technically

Transcription from natural spoken scenarios – funerals, meetings, oral literature etc as well as the collation of a number of specified genres of written text. Intellectual Property Rights [IPR] must be taken into account so that both academics and commercial companies have access

Languages and regions

  • Southern Africa
    • Zulu (Bantu language with conjunctive orthography (concatenated words))
    • Xhosa (Bantu language with conjunctive orthography (concatenated words))
    • Northern Sotho (Bantu language with disjunctive orthography)
    • Afrikaans
  • West Africa
  • East Africa
    • Swahili (Bantu language with conjunctive orthography)
    • Amharic
    • Gikuyu [NB- this was not in the original proposal but it has a large and growing amount of written resources to draw from]
    • Luganda [NB- this was not in the original proposal but it is a focus language of a localization group in Uganda)
    • Kinyarwanda [NB- this was not in the original proposal but there is an active localisation effort for this language]
  • North Africa


  • Linguists and Researchers
  • Available online searchable resource
  • Comparative analysis and research across and within languages
  • Existence of a baseline resource for further research
  • Software developers – spell checkers, speech recognition, dictionary makers
  • Basic resource for further work on improving development
  • Actual locals
  • Language preservation
  • Increased status
  • Enhanced knowledge base
  • Translators and language professionals and terminologists
  • Better understanding of language and translation strategies and term creation for technical domains

Risks and Assumptions

  • Risks
    • Its a long term project with a great deal of cooperation if the final corpus design is to be coherent
    • It requires a lot of human capacity at project management level, as well as technical resources
  • Assumptions
    • There must be sufficient software and hardware
    • There exist trained transcribers (for spoken text) and collaters and proofreaders (for written text) for each language with the necessary linguistic knowledge
    • There should be someone skilled enough to be in charge of the entire corpus architecture, as well as regional coordination of corpora
    • A lot of technical expertise is required for the actual corpus design

Time Frame
Stage 1(3 months): Recording oral events, choosing and collating written texts; liaising with CELHTO
Stage 2(1 year): Transcribing/digitizing and proofreading
Stage 3(3 months) : Collation into coherent corpus
Stage 4(0.5- 1 year) : Tagging where necessary
Stage 5(Ongoing): Verification and Improvement
On the average a corpus would take approximately 2 years to create an initial workable resource that can be further improved.

The overall project would take longer depending on the corpora that are included in the project.

Training and capacity building would be ongoing as development happens. This could impact on the project timelines.

Short Term
Single language Corpora – 2 – 3 years to develop (anything smaller would not be authoritative enough)

Medium to Long Term
If the design is coherent – 10 years to develop


< Capacity Building | Off the Wall Projects | Kasahorow Extensions >

Page last modified on 2019-07-21 05:44