Corpora4Learning Home | Bibliography | English corpora | Tools & websites | Projects

English corpora

This page offers short descriptions of the most widely known English language corpora. To find out more about any of these, click on the corpus title. This will take you to the homepage (or manual) of the corpus. 

American English 

top

  
  BROWN Corpus

Developer:  Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island
Collection date: 1960s

Size:

1 million words
Contents: written language; 500 text samples of approx. 2,000 words; 15 text categories
Annotation: untagged and tagged version
POS tagging
Availability: ICAME CD

  
  CPSAE  - Corpus of Spoken Professional American English

Developer:  Michael Barlow at Athelstan and Rice University, Houston/TX
Collection date: 1994-1998

Size:

2 million words (2 sub-corpora, approx. 1 million words each)
Contents: academic discourse and White House briefings; short interchanges by 400 speakers
Annotation: untagged and tagged version
POS tagging on the basis of the CLAWS tagset developed at Lancaster University
Availability: from Athelstan

  
  FROWN - Freiburg BROWN Corpus of American English

Developer:  Christian Mair at the University of Freiburg
Collection date: 1990s

Size:

1 million words
Contents: matches the original Brown corpus
Annotation: untagged
Availability: ICAME CD

   
 
MICASE - Michigan Corpus of Academic Spoken English

Developer:  R. C. Simpson, S. L. Briggs, J. Ovens, and J. M. Swales at the English Language Institute, University of Michigan
Collection date: since 1997, ongoing

Size:

1.7 million words
Contents: transcripts and audio files of academic speech
Annotation: discourse annotation 

Availability:

freely available on the Web (here)

  
  SUSANNE Corpus

Developer:  Geoffrey Sampson at the University of Essex
Collection date: 1960s

Size:

130,000 words
Contents: subset of BROWN Corpus
Annotation: POS tagged and syntactically parsed subset of BROWN

Availability:

freely available on the Web (here)

   

Australian English 

top

  
  ACE - Australian Corpus of English

Developer:  Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney
Collection date: 1986

Size:

1 million words; 500 text samples of approx. 2,000 words
Contents: written and spoken language; modelled on LOB and BROWN
Annotation: untagged
Availability: ICAME CD

  

British English 

top

  
  Bank of English (Collins Cobuild)

Developer:  John Sinclair and his team at the University of Birmingham and Harper-Collins
Collection date: 1980s

Size:

more than 450 million words in 2005, growing
Contents: one of the largest English corpora, a 'monitor' corpus (i.e. continually growing); originally collected as a basis for the creation of the COBUILD dictionary but since then continually expanded; originally containing 75 % written and 25 % spoken language, 70 % British, 20% American, 5 % other varieties; containing entire texts rather than samples; covering a wide cross-section of contemporary English (more info)

Annotation:

POS tagged
Availability: free search in Collins WordBanks online, a 56 million word subset of the BoE

  
  BNC - British National Corpus

Developer:  OUP, Longman, British Library
Collection date: 1991-1995

Size:

100 million words
Contents: a balanced corpus designed to cover a wide cross-section of contemporary British English; 90% written and 10% spoken language; more than 4,000 texts (samples) (more info)

Annotation:

textual markup, discourse annotation, POS tagging on the basis of the CLAWS tagset developed at Lancaster University
Availability: on CD ROM but also 
free search online (new web interface by M. Davies)
free sample search online

  
  COLT - Bergen Corpus of London Teenage Language

Developer:  University of Bergen, Norway
Collection date: 1993

Size:

500,000 words
Contents: transcripts of spoken language of London teenagers
(COLT is part of the BNC)
Annotation: POS tagging
Availability: ICAME CD 

  
  CHRISTINE Corpus

Developer:  Geoffrey Sampson at the University of Essex
Collection date: 1990s

Size:

100,000 words
Contents: informal spoken language (taken from BNC)
Annotation: POS tagged and syntactically parsed subset of spoken part of BNC

Availability:

freely available on the Web (here)

  
  FLOB - Freiburg-LOB Corpus of British English

Developer:  Christian Mair at the University of Freiburg
Collection date: 1990s

Size:

1 million words

Contents: matches the original LOB corpus
Annotation: untagged
Availability: ICAME CD

  
  ICE-GB - International Corpus of English, British Component

Developer:  co-ordinated by Gerald Nelson at University College London
Collection date: 1990-93

Size:

1 million words
Contents: written and spoken language covering a variety of genres (more info)
the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide
Annotation: textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability: on CD

  
  Lancaster Parsed Corpus

Developer:  Roger Garside, Geoffrey Leech and Tamas Varadi at the University of Lancaster
Collection date: 1978

Size

140,000 words
Contents: parsed subcorpus of the LOB
Annotation: POS tagging, syntactic parsing
Availability: ICAME CD

  
  LLC London-Lund Corpus of Spoken English

Developer:  Randolph Quirk and Sidney Greenbaum at University College London 
Jan Svartvik at Lund University
Collection date: 1960s, 1975-81, 1985-88

Size:

500,000 words
Contents: spoken language (more info)
based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University)
Annotation: prosodic and discourse annotation
Availability: ICAME CD

  
  LOB Lancaster/Oslo-Bergen Corpus

Developer:  compiled under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen
Collection date: 1970-1978

Size:

1 million words
Contents: written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus
Annotation: untagged and tagged version
POS tagging (CLAWS tagset)
Availability: ICAME CD

  
  POW - Polytechnic of Wales Corpus

Developer:  The Computational Linguistics Unit at University of Wales College of Cardiff
Collection date: 1980s

Size

65,000 words
Contents: transcripts of spoken language of children
Annotation: POS tagging, syntactic parsing
Availability: ICAME CD

  
  SEC Lancaster/IBM English Corpus 

Developer:  University of Lancaster and IBM Scientific Centre
Collection date: 1984-87

Size:

52,000 words
Contents: spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster, Open University tapes
Annotation: prosodic markup, POS tagged with CLAWS
Availability: ICAME CD

   

East African English 

top

  
  ICE-EA - International Corpus of English, East African Component

Developer:  Diana Hudson-Ettle, Barbara Krohne and Josef Schmied at Chemnitz University
Collection date: 1989-99

Size:

1 million words
Contents: written and spoken Kenyan and Tanzanian English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation: textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability: ICAME CD or see here

    

Indian English 

top

  
  ICE - International Corpus of English, Indian Component

Developer:  co-ordinated by S.V. Shastri at Shivaji University, Kolhapur, and Gerhard Leitner, Freie Universitat Berlin
Collection date: 1990-93

Size:

1 million words
Contents: written and spoken Indian English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation: textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability: see here

  
  Kolhapur Corpus

Developer:  S. K. Verma at University of Lancaster and Shivaji University, Kolhapur
Collection date: 1978

Size:

1 million words, 500 text samples of approx. 2,000 words
Contents: written language; modelled on BROWN and LOB

Annotation:

untagged
Availability: ICAME CD

   

New Zealand English 

top

  
  ICE - International Corpus of English, New Zealand Component

Developer:  co-ordinated by S.V. Shastri at Shivaji University, Kolhapur, and Gerhard Leitner, Freie Universitat Berlin
Collection date: 1990-93

Size:

1 million words
Contents: written and spoken New Zealand English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation: textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability: see here

    
  Wellington Corpus

Developer:  Laurie Bauer at Victoria University, Wellington
Collection date: 1986-90

Size:

1 million words; 500 text samples of approx. 2,000 words
Contents: written language; modelled on BROWN and LOB

Annotation:

untagged
Availability: ICAME CD

  
  Wellington Corpus of Spoken New Zealand English

Developer:  Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington
Collection date: 1988-94

Size:

1 million words; 500 text samples of approx. 2,000 words
Contents: spoken language; formal, semi-formal and informal speech

Annotation:

discourse markup
Availability: ICAME CD

   

Philippine English

top

  
  ICE - International Corpus of English, Philippine Component

Developer:  co-ordinated by Maria Lourdes S. Bautista, De La Salle University, Manila
Collection date: 1990-93

Size:

1 million words
Contents: written and spoken Philippine English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation: textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability: see here

   

Singapore English

top

  
  ICE - International Corpus of English, Indian Component

Developer:  co-ordinated by Anne Pakir at the National University of Singapore
Collection date: 1990-93

Size:

1 million words
Contents: written and spoken Singapore English; the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide (more info)
Annotation: textual markup, discourse annotation, POS tagging, syntactic parsing (more info)
Availability: see here

  

 

English as a Lingua Franca 

top

  
  VOICE Vienna Oxford International Corpus of English

Developer:  Barbara Seidlhofer at the Universiy of Vienna
Collection date: since 2001 (ongoing)

Size:

250.000 words to date, to be extended
Contents: spoken English; interactions in English as a lingua franca; unscripted, largely face-to-face communication among competent non-native speakers including private and public dialogues, private and public group discussions and casual conversations, and one-to-one interviews.

Annotation:

conversational markup

  
  ELFA English as a Lingua Franca in Academic Settings

Developer:  Anna Mauranen at Tampere University
Collection date: ongoing

Size:

0.5 million words
Contents: spoken academic English involving non-native speakers; includes various speech events (e.g. lectures, workshops, seminars, presentations)

Annotation:

   

Historical English 

top


  ARCHER Corpus - A Representative Corpus of Historical English Registers

Developer:  Northern Arizona University in co-operation with the Universities of Uppsala, Helsinki and Freiburg
Sampling period: 1650-1990

Size:

1.7 million words
Contents: 1,037 texts; 10 registers (e.g., drama, letters, science prose), including British and American; sampled from 7 historical periods covering Early Modern English; speech-based, popular, and specialist/academic written registers
Annotation: POS tagged
Availability:


  CEECS - Corpus of Early English Correspondence Sampler

Developer:  M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period: 1418-1680

Size:

450,000 words
Contents: personal letters
Annotation:
Availability: ICAME CD


  Helsinki Corpus of English Texts: Diachronic Part

Developer:  M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period: ca. 750 to 1700

Size:

1.6 million words 
Contents: Old, Middle and Early Modern English texts
Annotation:
Availability: ICAME CD


  Helsinki Corpus of Older Scots

Developer:  M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period: 1450-1700

Size:

830,000 words 
Contents: Old, Middle and Early Modern English texts
Annotation: untagged
Availability: ICAME CD


  Lampeter Corpus of Early Modern English Tracts

Developer:  Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz
Sampling period: 1640 -1740

Size:

1.1 million words
Contents: non-literary prose texts
Annotation: textual markup
Availability: ICAME CD


  Newdigate Newsletters Corpus

Developer:  Philip Hines, Jr., Norfolk, Virginia
Sampling period: 1692

Size:

750,000 words
Contents: a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire)
Annotation: untagged
Availability: ICAME CD

   

Books and other websites containing descriptions of corpora

top

  • Kennedy, Graeme (1998): Introduction to Corpus Linguistics. London: Longman.
      
  • Meyer, Charles (2002): English corpus linguistics: an introduction. Cambridge: CUP.
back to top

S.Braun (at) surrey.ac.uk

updated 26/04/05