Corpus (plural: corpora) refers to a body or collection of materials on a language that provides data on that language for various uses. In localization it is helpful for development of dictionaries and spell checkers?, and essential for more advanced applications such as machine translation.

Some definitions

[NB- this needs significant work]


"A large body of natural language text used for accumulating statistics on natural language text. Corpora often include extra information such as a tag for each word indicating its part-of-speech and perhaps the parse tree for each sentence." (MultiLingual)

Language corpus

"The term language corpus is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (written, spoken, or a mixture of the two), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified." (TEI)

Text corpus

"In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe.

"A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora." (Wikipedia; emphasis in original)

Tools, projects, proposals

References & links