English corpora
|
| This page offers short descriptions of the most widely known English language corpora. To find out more about any of these, click on the corpus title. This will take you to the homepage (or manual) of the corpus.
|
|
BROWN Corpus
| Developer: |
Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island |
| Collection date: |
1960s |
|
Size:
|
1 million words |
| Contents: |
written language; 500 text samples of approx. 2,000 words; 15 text categories |
| Annotation: |
untagged and tagged version POS tagging |
| Availability: |
ICAME CD |
CPSAE - Corpus of Spoken Professional American English
| Developer: |
Michael Barlow at Athelstan and Rice University, Houston/TX |
| Collection date: |
1994-1998 |
|
Size:
|
2 million words (2 sub-corpora, approx. 1 million words each) |
| Contents: |
academic discourse and White House briefings; short interchanges by 400 speakers |
| Annotation: |
untagged and tagged version POS tagging on the basis of the CLAWS tagset developed at Lancaster University |
| Availability: |
from Athelstan |
FROWN - Freiburg BROWN Corpus of American English
| Developer: |
Christian Mair at the University of Freiburg |
| Collection date: |
1990s |
|
Size:
|
1 million words |
| Contents: |
matches the original Brown corpus |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
MICASE - Michigan Corpus of Academic Spoken English
| Developer: |
R. C. Simpson, S. L. Briggs, J. Ovens, and J. M. Swales at the English Language Institute, University of Michigan |
| Collection date: |
since 1997, ongoing |
|
Size:
|
1.7 million words |
| Contents: |
transcripts and audio files of academic speech |
| Annotation: |
discourse annotation |
|
Availability:
|
freely available on the Web ( here) |
SUSANNE Corpus
| Developer: |
Geoffrey Sampson at the University of Essex |
| Collection date: |
1960s |
|
Size:
|
130,000 words |
| Contents: |
subset of BROWN Corpus |
| Annotation: |
POS tagged and syntactically parsed subset of BROWN |
|
Availability:
|
freely available on the Web ( here) |
|
|
ACE - Australian Corpus of English
| Developer: |
Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney |
| Collection date: |
1986 |
|
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
| Contents: |
written and spoken language; modelled on LOB and BROWN |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
|
|
Bank of English (Collins Cobuild)
| Developer: |
John Sinclair and his team at the University of Birmingham and Harper-Collins |
| Collection date: |
1980s |
|
Size:
|
more than 450 million words in 2005, growing |
| Contents: |
one of the largest English corpora, a 'monitor' corpus (i.e. continually growing); originally collected as a basis for the creation of the COBUILD dictionary but since then continually expanded; originally containing 75 % written and 25 % spoken language, 70 % British, 20% American, 5 % other varieties; containing entire texts rather than samples; covering a wide cross-section of contemporary English ( more info) |
|
Annotation:
|
POS tagged |
| Availability: |
free search in Collins WordBanks online, a 56 million word subset of the BoE |
BNC - British National Corpus
COLT - Bergen Corpus of London Teenage Language
| Developer: |
University of Bergen, Norway |
| Collection date: |
1993 |
|
Size:
|
500,000 words |
| Contents: |
transcripts of spoken language of London teenagers (COLT is part of the BNC) |
| Annotation: |
POS tagging |
| Availability: |
ICAME CD |
CHRISTINE Corpus
| Developer: |
Geoffrey Sampson at the University of Essex |
| Collection date: |
1990s |
|
Size:
|
100,000 words |
| Contents: |
informal spoken language (taken from BNC) |
| Annotation: |
POS tagged and syntactically parsed subset of spoken part of BNC |
|
Availability:
|
freely available on the Web ( here) |
FLOB - Freiburg-LOB Corpus of British English
| Developer: |
Christian Mair at the University of Freiburg |
| Collection date: |
1990s |
|
Size:
|
1 million words
|
| Contents: |
matches the original LOB corpus |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
ICE-GB - International Corpus of English, British Component
| Developer: |
co-ordinated by Gerald Nelson at University College London |
| Collection date: |
1990-93 |
|
Size:
|
1 million words |
| Contents: |
written and spoken language covering a variety of genres ( more info) the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide |
| Annotation: |
textual markup, discourse annotation, POS tagging, syntactic parsing ( more info) |
| Availability: |
on CD |
Lancaster Parsed Corpus
| Developer: |
Roger Garside, Geoffrey Leech and Tamas Varadi at the University of Lancaster |
| Collection date: |
1978 |
|
Size
|
140,000 words |
| Contents: |
parsed subcorpus of the LOB |
| Annotation: |
POS tagging, syntactic parsing |
| Availability: |
ICAME CD |
LLC London-Lund Corpus of Spoken English
| Developer: |
Randolph Quirk and Sidney Greenbaum at University College London Jan Svartvik at Lund University |
| Collection date: |
1960s, 1975-81, 1985-88 |
|
Size:
|
500,000 words |
| Contents: |
spoken language ( more info) based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University) |
| Annotation: |
prosodic and discourse annotation |
| Availability: |
ICAME CD |
LOB Lancaster/Oslo-Bergen Corpus
| Developer: |
compiled under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen |
| Collection date: |
1970-1978 |
|
Size:
|
1 million words |
| Contents: |
written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus |
| Annotation: |
untagged and tagged version POS tagging (CLAWS tagset) |
| Availability: |
ICAME CD |
POW - Polytechnic of Wales Corpus
| Developer: |
The Computational Linguistics Unit at University of Wales College of Cardiff |
| Collection date: |
1980s |
|
Size
|
65,000 words |
| Contents: |
transcripts of spoken language of children |
| Annotation: |
POS tagging, syntactic parsing |
| Availability: |
ICAME CD |
SEC Lancaster/IBM English Corpus
| Developer: |
University of Lancaster and IBM Scientific Centre |
| Collection date: |
1984-87 |
|
Size:
|
52,000 words |
| Contents: |
spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster, Open University tapes |
| Annotation: |
prosodic markup, POS tagged with CLAWS |
| Availability: |
ICAME CD |
|
|
ICE-EA - International Corpus of English, East African Component
|
|
ICE - International Corpus of English, Indian Component
Kolhapur Corpus
| Developer: |
S. K. Verma at University of Lancaster and Shivaji University, Kolhapur |
| Collection date: |
1978 |
|
Size:
|
1 million words, 500 text samples of approx. 2,000 words |
| Contents: |
written language; modelled on BROWN and LOB |
|
Annotation:
|
untagged |
| Availability: |
ICAME CD |
|
|
ICE - International Corpus of English, New Zealand Component
Wellington Corpus
| Developer: |
Laurie Bauer at Victoria University, Wellington |
| Collection date: |
1986-90 |
|
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
| Contents: |
written language; modelled on BROWN and LOB |
|
Annotation:
|
untagged |
| Availability: |
ICAME CD |
Wellington Corpus of Spoken New Zealand English
| Developer: |
Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington |
| Collection date: |
1988-94 |
|
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
| Contents: |
spoken language; formal, semi-formal and informal speech |
|
Annotation:
|
discourse markup |
| Availability: |
ICAME CD |
|
|
ICE - International Corpus of English, Philippine Component
|
|
ICE - International Corpus of English, Indian Component
|
English as a Lingua Franca
|
|
VOICE Vienna Oxford International Corpus of English
| Developer: |
Barbara Seidlhofer at the Universiy of Vienna |
| Collection date: |
since 2001 (ongoing) |
|
Size:
|
250.000 words to date, to be extended |
| Contents: |
spoken English; interactions in English as a lingua franca; unscripted, largely face-to-face communication among competent non-native speakers including private and public dialogues, private and public group discussions and casual conversations, and one-to-one interviews. |
|
Annotation:
|
conversational markup |
ELFA English as a Lingua Franca in Academic Settings
| Developer: |
Anna Mauranen at Tampere University |
| Collection date: |
ongoing |
|
Size:
|
0.5 million words |
| Contents: |
spoken academic English involving non-native speakers; includes various speech events (e.g. lectures, workshops, seminars, presentations) |
|
Annotation:
|
|
|
|
ARCHER Corpus - A Representative Corpus of Historical English Registers
| Developer: |
Northern Arizona University in co-operation with the Universities of Uppsala, Helsinki and Freiburg |
| Sampling period: |
1650-1990 |
|
Size:
|
1.7 million words |
| Contents: |
1,037 texts; 10 registers (e.g., drama, letters, science prose), including British and American; sampled from 7 historical periods covering Early Modern English; speech-based, popular, and specialist/academic written registers |
| Annotation: |
POS tagged |
| Availability: |
|
CEECS - Corpus of Early English Correspondence Sampler
| Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki |
| Sampling period: |
1418-1680 |
|
Size:
|
450,000 words |
| Contents: |
personal letters |
| Annotation: |
|
| Availability: |
ICAME CD |
Helsinki Corpus of English Texts: Diachronic Part
| Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki |
| Sampling period: |
ca. 750 to 1700 |
|
Size:
|
1.6 million words |
| Contents: |
Old, Middle and Early Modern English texts |
| Annotation: |
|
| Availability: |
ICAME CD |
Helsinki Corpus of Older Scots
| Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki |
| Sampling period: |
1450-1700 |
|
Size:
|
830,000 words |
| Contents: |
Old, Middle and Early Modern English texts |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
Lampeter Corpus of Early Modern English Tracts
| Developer: |
Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz |
| Sampling period: |
1640 -1740 |
|
Size:
|
1.1 million words |
| Contents: |
non-literary prose texts |
| Annotation: |
textual markup |
| Availability: |
ICAME CD |
Newdigate Newsletters Corpus
| Developer: |
Philip Hines, Jr., Norfolk, Virginia |
| Sampling period: |
1692 |
|
Size:
|
750,000 words |
| Contents: |
a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire) |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
|
Books and other websites containing descriptions of corpora
|
|
- Kennedy, Graeme (1998): Introduction to Corpus Linguistics. London: Longman.
- Meyer, Charles (2002): English corpus linguistics: an introduction. Cambridge: CUP.
|
|
|