|
BROWN Corpus
| Developer: |
Nelson Francis and Henry Kucera at Brown
University, Providence, Rhode Island |
| Collection date: |
1960s |
|
Size:
|
1 million words |
| Contents: |
written language; 500 text samples
of approx. 2,000 words; 15 text categories |
| Annotation: |
untagged and tagged version
POS tagging |
| Availability: |
ICAME CD |
CPSAE
- Corpus of Spoken Professional American English
| Developer: |
Michael Barlow at Athelstan and Rice
University, Houston/TX |
| Collection date: |
1994-1998 |
|
Size:
|
2 million words (2 sub-corpora, approx. 1 million words each) |
| Contents: |
academic discourse and White House briefings; short interchanges by 400 speakers |
| Annotation: |
untagged and tagged version
POS tagging on the basis of the CLAWS
tagset developed at Lancaster University |
| Availability: |
from Athelstan |
FROWN
- Freiburg BROWN Corpus of American English
| Developer: |
Christian Mair at the University of Freiburg |
| Collection date: |
1990s |
|
Size:
|
1 million words |
| Contents: |
matches the original Brown corpus |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
MICASE
- Michigan Corpus of Academic Spoken English
| Developer: |
R. C. Simpson, S. L. Briggs, J. Ovens, and J. M. Swales
at the English Language Institute, University of Michigan |
| Collection date: |
since 1997, ongoing |
|
Size:
|
1.7 million words |
| Contents: |
transcripts and audio files of academic speech |
| Annotation: |
discourse annotation |
|
Availability:
|
freely available on the Web ( here) |
SUSANNE
Corpus
| Developer: |
Geoffrey Sampson at the University of Essex |
| Collection date: |
1960s |
|
Size:
|
130,000 words |
| Contents: |
subset of BROWN Corpus |
| Annotation: |
POS tagged and syntactically parsed subset of BROWN |
|
Availability:
|
freely available on the Web ( here) |
|
|
Bank of
English (Collins Cobuild)
| Developer: |
John Sinclair and his team at the University of
Birmingham and Harper-Collins |
| Collection date: |
1980s |
|
Size:
|
more than 450 million words in 2005, growing |
| Contents: |
one of the largest English corpora, a 'monitor' corpus
(i.e. continually growing); originally collected as a basis for the
creation of the COBUILD dictionary but since then continually
expanded; originally containing 75 % written and 25 % spoken language,
70 % British, 20% American, 5 % other varieties; containing entire
texts rather than samples; covering a wide cross-section of
contemporary English ( more
info) |
|
Annotation:
|
POS tagged |
|
Availability:
|
free search in Collins
WordBanks online, a 56 million word subset of the BoE |
BNC -
British National Corpus
COLT
- Bergen Corpus of London Teenage Language
| Developer: |
University of Bergen, Norway |
| Collection date: |
1993 |
|
Size:
|
500,000 words |
| Contents: |
transcripts of spoken language of London teenagers
(COLT is part of the BNC) |
| Annotation: |
POS tagging |
| Availability: |
ICAME CD |
CHRISTINE
Corpus
| Developer: |
Geoffrey Sampson at the University of Essex |
| Collection date: |
1990s |
|
Size:
|
100,000 words |
| Contents: |
informal spoken language (taken from BNC) |
| Annotation: |
POS tagged and syntactically parsed subset of spoken
part of BNC |
|
Availability:
|
freely available on the Web ( here) |
FLOB
- Freiburg-LOB Corpus of British English
| Developer: |
Christian Mair at the University of Freiburg |
| Collection date: |
1990s |
|
Size:
|
1 million words
|
| Contents: |
matches the original LOB corpus |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
ICE-GB
- International Corpus of English, British Component
| Developer: |
co-ordinated by Gerald Nelson at University College
London |
| Collection date: |
1990-93 |
|
Size:
|
1 million words |
| Contents: |
written and spoken language covering a variety of
genres ( more
info)
the aim International Corpus of English (ICE) project was to build
comparable corpora of 15 regional varieties of English for comparative
studies of English worldwide |
| Annotation: |
textual markup, discourse annotation, POS tagging,
syntactic parsing ( more
info) |
| Availability: |
on CD |
Lancaster
Parsed Corpus
| Developer: |
Roger Garside, Geoffrey Leech and Tamas Varadi at
the University of Lancaster |
| Collection date: |
1978 |
|
Size
|
140,000 words |
| Contents: |
parsed subcorpus of the LOB |
| Annotation: |
POS tagging, syntactic parsing |
| Availability: |
ICAME CD |
LLC
London-Lund Corpus of Spoken English
| Developer: |
Randolph Quirk and Sidney Greenbaum at University
College London
Jan Svartvik at Lund University |
| Collection date: |
1960s, 1975-81, 1985-88 |
|
Size:
|
500,000 words |
| Contents: |
spoken language ( more
info)
based on the Survey of English Usage (SEU, 1959, University College
London) and on the Survey of Spoken English (SSE, 1975, Lund
University) |
| Annotation: |
prosodic and discourse annotation |
| Availability: |
ICAME CD |
LOB
Lancaster/Oslo-Bergen Corpus
| Developer: |
compiled under the direction of Geoffrey Leech,
University of Lancaster, and Stig Johansson, University of Oslo, in
collaboration with Knut Hofland, Norwegian Computing Centre for the
Humanities, Bergen |
| Collection date: |
1970-1978 |
|
Size:
|
1 million words |
| Contents: |
written language; 500 text samples of approx.
2,000 words; 15 text categories; British counterpart of Brown corpus |
| Annotation: |
untagged and tagged version
POS tagging (CLAWS tagset) |
| Availability: |
ICAME CD |
POW
- Polytechnic of Wales Corpus
| Developer: |
The Computational Linguistics Unit at University of
Wales College of Cardiff |
| Collection date: |
1980s |
|
Size
|
65,000 words |
| Contents: |
transcripts of spoken language of children |
| Annotation: |
POS tagging, syntactic parsing |
| Availability: |
ICAME CD |
SEC
Lancaster/IBM English Corpus
| Developer: |
University of Lancaster and IBM Scientific Centre |
| Collection date: |
1984-87 |
|
Size:
|
52,000 words |
| Contents: |
spoken language; transcripts from radio-broadcasts,
recordings made at University of Lancaster, Open University tapes |
| Annotation: |
prosodic markup, POS tagged with CLAWS |
| Availability: |
ICAME CD |
|
|
ICE-EA
- International Corpus of English, East African Component
|
|
ICE
- International Corpus of English, Indian Component
Kolhapur
Corpus
| Developer: |
S. K. Verma at University of Lancaster and Shivaji
University, Kolhapur |
| Collection date: |
1978 |
|
Size:
|
1 million words, 500 text samples of approx. 2,000 words |
| Contents: |
written language;
modelled on BROWN and LOB |
|
Annotation:
|
untagged |
|
Availability:
|
ICAME CD |
|
|
ICE
- International Corpus of English, New Zealand Component
Wellington
Corpus
| Developer: |
Laurie Bauer at Victoria University, Wellington |
| Collection date: |
1986-90 |
|
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
| Contents: |
written language;
modelled on BROWN and LOB |
|
Annotation:
|
untagged |
|
Availability:
|
ICAME CD |
Wellington
Corpus of Spoken New Zealand English
| Developer: |
Janet Holmes, Bernadette Vine and Gary Johnson at at
Victoria University, Wellington |
| Collection date: |
1988-94 |
|
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
| Contents: |
spoken language; formal, semi-formal and informal
speech |
|
Annotation:
|
discourse markup |
|
Availability:
|
ICAME CD |
|
|
ICE
- International Corpus of English, Philippine Component
|
|
ICE
- International Corpus of English, Indian Component
|
English as a Lingua Franca
|
|
VOICE
Vienna Oxford International Corpus of English
| Developer: |
Barbara Seidlhofer at the Universiy of Vienna |
| Collection date: |
since 2001 (ongoing) |
|
Size:
|
250.000 words to date, to be extended |
| Contents: |
spoken English; interactions in English as a lingua franca; unscripted,
largely face-to-face communication among competent non-native speakers
including private and public dialogues, private and public group
discussions and casual conversations, and one-to-one interviews. |
|
Annotation:
|
conversational markup |
ELFA
English as a Lingua Franca in Academic Settings
| Developer: |
Anna Mauranen at Tampere University |
| Collection date: |
ongoing |
|
Size:
|
0.5 million words |
| Contents: |
spoken academic English involving non-native speakers;
includes various speech events (e.g. lectures, workshops, seminars,
presentations) |
|
Annotation:
|
|
|
|
ARCHER
Corpus - A Representative Corpus of Historical English Registers
| Developer: |
Northern Arizona University in co-operation
with the Universities of Uppsala, Helsinki and Freiburg |
| Sampling period: |
1650-1990 |
|
Size:
|
1.7 million words |
| Contents: |
1,037 texts; 10 registers (e.g., drama,
letters, science prose), including British and American; sampled from
7 historical periods covering Early Modern English; speech-based,
popular, and specialist/academic written registers |
| Annotation: |
POS tagged |
| Availability: |
|
CEECS
- Corpus of Early English Correspondence Sampler
| Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at
the Department of English, University of Helsinki |
| Sampling period: |
1418-1680 |
|
Size:
|
450,000 words |
| Contents: |
personal letters |
| Annotation: |
|
| Availability: |
ICAME CD |
Helsinki
Corpus of English Texts: Diachronic Part
| Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of
Helsinki |
| Sampling period: |
ca. 750 to 1700 |
|
Size:
|
1.6 million words |
| Contents: |
Old, Middle and Early Modern English texts |
| Annotation: |
|
| Availability: |
ICAME CD |
Helsinki
Corpus of Older Scots
| Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of
Helsinki |
| Sampling period: |
1450-1700 |
|
Size:
|
830,000 words |
| Contents: |
Old, Middle and Early Modern English texts |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
Lampeter
Corpus of Early Modern English Tracts
| Developer: |
Josef Schmied, Claudia Claridge and Rainer
Siemund at TU Chemnitz |
| Sampling period: |
1640 -1740 |
|
Size:
|
1.1 million words |
| Contents: |
non-literary prose texts |
| Annotation: |
textual markup |
| Availability: |
ICAME CD |
Newdigate
Newsletters Corpus
| Developer: |
Philip Hines, Jr., Norfolk, Virginia |
| Sampling period: |
1692 |
|
Size:
|
750,000 words |
| Contents: |
a series of more than 2,000 newsletters in
the Newdigate series (most of which are addressed to Sir Richard
Newdigate, Warwickshire) |
| Annotation: |
untagged |
| Availability: |
ICAME CD |
|