Corpora4Learning Home | Bibliography | English corpora | Tools & websites | Projects

English corpora

This page offers short descriptions of the most widely known English language corpora. To find out more about any of these, click on the corpus title. This will take you to the homepage (or manual) of the corpus.

American English
Australian English
British English
East African English
Indian English
New Zealand English
Philippine English
Singapore English
English as a Lingua Franca
Historical English
Other descriptions and websites

American English

top

BROWN Corpus

Developer:	Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island
Collection date:	1960s
Size:	1 million words
Contents:	written language; 500 text samples of approx. 2,000 words; 15 text categories
Annotation:	untagged and tagged version POS tagging
Availability:	ICAME CD

CPSAE - Corpus of Spoken Professional American English

Developer:	Michael Barlow at Athelstan and Rice University, Houston/TX
Collection date:	1994-1998
Size:	2 million words (2 sub-corpora, approx. 1 million words each)
Contents:	academic discourse and White House briefings; short interchanges by 400 speakers
Annotation:	untagged and tagged version POS tagging on the basis of the CLAWS tagset developed at Lancaster University
Availability:	from Athelstan

FROWN - Freiburg BROWN Corpus of American English

Developer:	Christian Mair at the University of Freiburg
Collection date:	1990s
Size:	1 million words
Contents:	matches the original Brown corpus
Annotation:	untagged
Availability:	ICAME CD

MICASE - Michigan Corpus of Academic Spoken English

Developer:	R. C. Simpson, S. L. Briggs, J. Ovens, and J. M. Swales at the English Language Institute, University of Michigan
Collection date:	since 1997, ongoing
Size:	1.7 million words
Contents:	transcripts and audio files of academic speech
Annotation:	discourse annotation
Availability:	freely available on the Web (here)

SUSANNE Corpus

Developer:	Geoffrey Sampson at the University of Essex
Collection date:	1960s
Size:	130,000 words
Contents:	subset of BROWN Corpus
Annotation:	POS tagged and syntactically parsed subset of BROWN
Availability:	freely available on the Web (here)

Australian English

top

ACE - Australian Corpus of English

Developer:	Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney
Collection date:	1986
Size:	1 million words; 500 text samples of approx. 2,000 words
Contents:	written and spoken language; modelled on LOB and BROWN
Annotation:	untagged
Availability:	ICAME CD

British English

top

Bank of English (Collins Cobuild)

Developer:	John Sinclair and his team at the University of Birmingham and Harper-Collins
Collection date:	1980s
Size:	more than 450 million words in 2005, growing
Contents:	one of the largest English corpora, a 'monitor' corpus (i.e. continually growing); originally collected as a basis for the creation of the COBUILD dictionary but since then continually expanded; originally containing 75 % written and 25 % spoken language, 70 % British, 20% American, 5 % other varieties; containing entire texts rather than samples; covering a wide cross-section of contemporary English (more info)
Annotation:	POS tagged
Availability:	free search in Collins WordBanks online, a 56 million word subset of the BoE

BNC - British National Corpus

Developer:	OUP, Longman, British Library
Collection date:	1991-1995
Size:	100 million words
Contents:	a balanced corpus designed to cover a wide cross-section of contemporary British English; 90% written and 10% spoken language; more than 4,000 texts (samples) (more info)
Annotation:	textual markup, discourse annotation, POS tagging on the basis of the CLAWS tagset developed at Lancaster University
Availability:	on CD ROM but also free search online (new web interface by M. Davies) free sample search online

COLT - Bergen Corpus of London Teenage Language

Developer:	University of Bergen, Norway
Collection date:	1993
Size:	500,000 words
Contents:	transcripts of spoken language of London teenagers (COLT is part of the BNC)
Annotation:	POS tagging
Availability:	ICAME CD

CHRISTINE Corpus

Developer:	Geoffrey Sampson at the University of Essex
Collection date:	1990s
Size:	100,000 words
Contents:	informal spoken language (taken from BNC)
Annotation:	POS tagged and syntactically parsed subset of spoken part of BNC
Availability:	freely available on the Web (here)

FLOB - Freiburg-LOB Corpus of British English

Developer:	Christian Mair at the University of Freiburg
Collection date:	1990s
Size:	1 million words
Contents:	matches the original LOB corpus
Annotation:	untagged
Availability:	ICAME CD

ICE-GB - International Corpus of English, British Component

Developer:	co-ordinated by Gerald Nelson at University College London
Collection date:	1990-93
Size:	1 million words
Contents:	written and spoken language covering a variety of genres (more info) the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide
Annotation:	textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability:	on CD

Lancaster Parsed Corpus

Developer:	Roger Garside, Geoffrey Leech and Tamas Varadi at the University of Lancaster
Collection date:	1978
Size	140,000 words
Contents:	parsed subcorpus of the LOB
Annotation:	POS tagging, syntactic parsing
Availability:	ICAME CD

LLC London-Lund Corpus of Spoken English

Developer:	Randolph Quirk and Sidney Greenbaum at University College London Jan Svartvik at Lund University
Collection date:	1960s, 1975-81, 1985-88
Size:	500,000 words
Contents:	spoken language (more info) based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University)
Annotation:	prosodic and discourse annotation
Availability:	ICAME CD

LOB Lancaster/Oslo-Bergen Corpus

Developer:	compiled under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen
Collection date:	1970-1978
Size:	1 million words
Contents:	written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus
Annotation:	untagged and tagged version POS tagging (CLAWS tagset)
Availability:	ICAME CD

POW - Polytechnic of Wales Corpus

Developer:	The Computational Linguistics Unit at University of Wales College of Cardiff
Collection date:	1980s
Size	65,000 words
Contents:	transcripts of spoken language of children
Annotation:	POS tagging, syntactic parsing
Availability:	ICAME CD

SEC Lancaster/IBM English Corpus

Developer:	University of Lancaster and IBM Scientific Centre
Collection date:	1984-87
Size:	52,000 words
Contents:	spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster, Open University tapes
Annotation:	prosodic markup, POS tagged with CLAWS
Availability:	ICAME CD

East African English

top

ICE-EA - International Corpus of English, East African Component

Developer:	Diana Hudson-Ettle, Barbara Krohne and Josef Schmied at Chemnitz University
Collection date:	1989-99
Size:	1 million words
Contents:	written and spoken Kenyan and Tanzanian English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation:	textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability:	ICAME CD or see here

Indian English

top

ICE - International Corpus of English, Indian Component

Developer:	co-ordinated by S.V. Shastri at Shivaji University, Kolhapur, and Gerhard Leitner, Freie Universitat Berlin
Collection date:	1990-93
Size:	1 million words
Contents:	written and spoken Indian English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation:	textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability:	see here

Kolhapur Corpus

Developer:	S. K. Verma at University of Lancaster and Shivaji University, Kolhapur
Collection date:	1978
Size:	1 million words, 500 text samples of approx. 2,000 words
Contents:	written language; modelled on BROWN and LOB
Annotation:	untagged
Availability:	ICAME CD

New Zealand English

top

ICE - International Corpus of English, New Zealand Component

Developer:	co-ordinated by S.V. Shastri at Shivaji University, Kolhapur, and Gerhard Leitner, Freie Universitat Berlin
Collection date:	1990-93
Size:	1 million words
Contents:	written and spoken New Zealand English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation:	textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability:	see here

Wellington Corpus

Developer:	Laurie Bauer at Victoria University, Wellington
Collection date:	1986-90
Size:	1 million words; 500 text samples of approx. 2,000 words
Contents:	written language; modelled on BROWN and LOB
Annotation:	untagged
Availability:	ICAME CD

Wellington Corpus of Spoken New Zealand English

Developer:	Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington
Collection date:	1988-94
Size:	1 million words; 500 text samples of approx. 2,000 words
Contents:	spoken language; formal, semi-formal and informal speech
Annotation:	discourse markup
Availability:	ICAME CD

Philippine English

top

ICE - International Corpus of English, Philippine Component

Developer:	co-ordinated by Maria Lourdes S. Bautista, De La Salle University, Manila
Collection date:	1990-93
Size:	1 million words
Contents:	written and spoken Philippine English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation:	textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability:	see here

Singapore English

top

ICE - International Corpus of English, Indian Component

Developer:	co-ordinated by Anne Pakir at the National University of Singapore
Collection date:	1990-93
Size:	1 million words
Contents:	written and spoken Singapore English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation:	textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability:	see here

English as a Lingua Franca

top

VOICE Vienna Oxford International Corpus of English

Developer:	Barbara Seidlhofer at the Universiy of Vienna
Collection date:	since 2001 (ongoing)
Size:	250.000 words to date, to be extended
Contents:	spoken English; interactions in English as a lingua franca; unscripted, largely face-to-face communication among competent non-native speakers including private and public dialogues, private and public group discussions and casual conversations, and one-to-one interviews.
Annotation:	conversational markup

ELFA English as a Lingua Franca in Academic Settings

Developer:	Anna Mauranen at Tampere University
Collection date:	ongoing
Size:	0.5 million words
Contents:	spoken academic English involving non-native speakers; includes various speech events (e.g. lectures, workshops, seminars, presentations)
Annotation:

Historical English

top

ARCHER Corpus - A Representative Corpus of Historical English Registers

Developer:	Northern Arizona University in co-operation with the Universities of Uppsala, Helsinki and Freiburg
Sampling period:	1650-1990
Size:	1.7 million words
Contents:	1,037 texts; 10 registers (e.g., drama, letters, science prose), including British and American; sampled from 7 historical periods covering Early Modern English; speech-based, popular, and specialist/academic written registers
Annotation:	POS tagged
Availability:

CEECS - Corpus of Early English Correspondence Sampler

Developer:	M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:	1418-1680
Size:	450,000 words
Contents:	personal letters
Annotation:
Availability:	ICAME CD

Helsinki Corpus of English Texts: Diachronic Part

Developer:	M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:	ca. 750 to 1700
Size:	1.6 million words
Contents:	Old, Middle and Early Modern English texts
Annotation:
Availability:	ICAME CD

Helsinki Corpus of Older Scots

Developer:	M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:	1450-1700
Size:	830,000 words
Contents:	Old, Middle and Early Modern English texts
Annotation:	untagged
Availability:	ICAME CD

Lampeter Corpus of Early Modern English Tracts

Developer:	Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz
Sampling period:	1640 -1740
Size:	1.1 million words
Contents:	non-literary prose texts
Annotation:	textual markup
Availability:	ICAME CD

Newdigate Newsletters Corpus

Developer:	Philip Hines, Jr., Norfolk, Virginia
Sampling period:	1692
Size:	750,000 words
Contents:	a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire)
Annotation:	untagged
Availability:	ICAME CD

Books and other websites containing descriptions of corpora

top

Kennedy, Graeme (1998): Introduction to Corpus Linguistics. London: Longman.
Meyer, Charles (2002): English corpus linguistics: an introduction. Cambridge: CUP.

Ucrel at the University of Lancaster (English corpora)
SFB 441 at the University of Tuebingen (English and other languages)
SDU at O dense University (English and other languages)
Michael Barlow's corpus page (English and other languages
Emily Bender's corpus page (English and other languages)
Yvonne Breyer's corpus page (English and other corpora)