ICAME Collection Of English Language Corpora
I. BRIEF OVERVIEW OF STRUCTURE AND CONTENT:
These corpora are distributed through the International Computer Archive of Modern English (ICAME). ICAME is an organization of linguists and information scientists working with English machine-readable texts. The aim of the organisation is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.
Each corpus was produced by a different research team, as explained below.
- The Brown Corpus
The Brown Corpus was compiled in the early 1960s at Brown University, USA, under the direction of W. Nelson Francis and Henry Kucera. It contains 500 text samples of some 2,000 words representing 15 categories of American English texts printed in 1961.
- The LOB Corpus
The Lancaster-Oslo/Bergen (LOB) Corpus was compiled in the 1970s under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo. It is a British English counterpart of the Brown Corpus and contains 500 text samples selected from texts printed in Great Britain in 1961. A list of the tags used in the corpus is included on the disc, which contains both tagged and untagged versions of the LOB Corpus.
- The Kolhapur Corpus
The Kolhapur Corpus is an Indian English counterpart of the Brown and LOB corpora, compiled under the direction of S. V. Shastri, Shivaji University, Kolhapur. It contains 500 text samples selected from English texts printed in India in 1978. The Kolhapur Corpus contains the same text categories as the British and American counterparts, but the weighting and the internal structure of some of the text categories are somewhat different, due to inherent differences in the Indian situation. A list of the texts in the Kolhapur Corpus is included on the disc.
- The London-Lund Corpus
The London-Lund Corpus contains 100 spoken English texts of some 5,000 words collected and transcribed at the Survey of English Usage, University College London, under the direction of Randolph Quirk.
The texts in the corpus are transcribed orthographically, with detailed prosodic marking. They represent a range of text categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc. For a full description of the corpus, see Svartvik (1990, 1992).
Note that, when the London-Lund Corpus was first made available, it contained 87 texts. The 13 new texts which have since been added are included in all the versions of the corpus found on this disc.
- The Helsinki Corpus of English Texts: Diachronic Part
This corpus was compiled at the University of Helsinki, under the direction of Matti Rissanen.
The corpus consists of a selection of texts covering the Old, Middle, and Early Modern English periods, totalling 1.5 million words.
II. USING ICAME CORPORA:
The CD-ROM contains ASCII files, with different versions for PC, Macintosh, and UNIX environments. No search or indexing programs are included; in order to use the data, then, you must import the ASCII files into a program (such as a word processing program) that will allow you to search or otherwise manipulate the files.
If you wish to use the Corpora, please ask an Etext Center staffer for further assistance.