Corpora4Learning Home | Bibliography | English corpora | Tools & websites | Projects

Tools & websites

This page offers information about some common corpus tools and links to resources on the web. 

Online search in corpora

top

This section links to corpora that can be freely searched online. Each of them comes with their own search engine/interface and with different features. Some of the websites offer search in more the one corpus. 
NB: This section focusses on the features available online. The corpora themselves (e.g. Bank of English, British National Corpus, Brown Corpus) are briefly described in the English corpora section.
  • Bank of English sampler – search in a 56 million word subset of the Bank of English:
    - Search by word, phrase, wildcard, part of speech or a combination of these.
    - KWIC concordances of variable length (concordance output restricted to 40 lines).
    - Collocation sampler to retrieve a word's most significant collocates.
        
  • British National Corpus (BNC) - sample search in the BNC at the BNC website: 
    - Search by word, phrase, wildcard, part of speech or a combination of these.
    - Sentence concordances (output restricted to 50 samples).
       
    Also available for the BNC:
       
  • PIE (Phrases in English) – web interface based on BNC phrases, by W.H. Fletcher: 
    - Search for frequently co-occuring words of  2 to 8 words length (word clusters).
    - Search all clusters of a particular length or clusters containing a particular word, phrase or part of speech.
    - Cluster lists with frequency statistics, and KWIC concordances of the clusters.
       
  • VIEW (Variation in English Words and Phrases) – web interface for the BNC, by M. Davies: 
    - Search by word, phrase, wildcard, part of speech or a combination of these.
    - Search in the entire corpus as well as genre-specific searches.
    - Frequency statistics, collocates and KWIC concordances.
    - Compare quasi-synonyms or other related words and their collocates.
         
  • Business Letter Corpus – search in business letters and some other texts, by S. Yasumasa:
    - Search by word, phrase or wildcard.
    - KWIC concordances of variable length.
         
  • Compleat Lexical Tutor ('corpus-based concordance' section) - search in a range of corpora, in particular Brown Corpus and a 2 million word subset of the BNC as well as  a range of smaller corpora:
    - Search by word, phrase or wildcard.
    - KWIC concordances of variable length, collocate frequencies.
    - Gapped KWIC concordances as a basis for exercises.
         
  • Corpuseye – search in different types of corpora, especially The Wikipedia as a corpus:
    - Search by words or phrases.
    - KWIC concordances, collocate frequency.
    - Morphosyntactic analysis analysis of concordance lines.
         
  • Edict Virtual Language Centre Web Concordancer – search in a range of corpora, especially Brown Corpus, LOB as well as literary and other texts (The Times, Hitchhiker's Guide to the Galaxy, King James Bible, Starr Report)
    - Search by word, phrase or wildcard
    - KWIC concordances of variable length, collocate frequencies, sentence concordances
    - Gapped KWIC concordances as a basis for exercises
    - Collocational frameworks
         
  • ELISA - English Language Interview Corpus as a Second-Language Application - a small audiovisual corpus of spoken English developed with pedagogical goals:
    - Easy access to full interview text and videos
    - Browse corpus by topic index
    - Online concordancer (KWIC and sentence format, search by word, phrase or wildcard)
    - Ready-made concordance of all words in the whole corpus and in each interview
    - Ready made frequency lists word the whole corpus and each interview
         
  • MICASE - Michigan Corpus of Academic Spoken English - search according to a range of criteria:
    - Browse according to specified speaker and speech event attributes (file references)
    - Search by word or phrase in specified contexts (KWIC concordances)
       
  • WebCorp – search in the entire Web as the corpus (basis: Google)
    - Search by word, phrase or wildcard
    - KWIC cconcordances, word lists, some good advanced features
    - Disadvantage: not language-specific  

Online full-text search in books

top

Text and media archives

top

The archives listed below offer a variety of texts and smaller corpora for download. To search them with corpus analysis methods, you will normally need an offline text/corpus analysis tool, i.e. a concordancer. Alternatively, you may be able to carry out some simple analyses with online text analysis tools
  • American Rhetoric project – media archive
    More than 5000 full text, audio and (streaming) video versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events.
       
  • Internet Archive - media archive
    A digital library of Internet sites and other cultural artifacts in digital form (text, audio, video).
       
  • Literary Web Concordances – literary texts
    Free online search (concordances and a range of interesting features).
       
  • Online Books Page (University of Pennsylvania) – literary texts 
    Free access to texts in different formats (meta search in a number of archives).
       
  • Oxford Text Archive – literary texts 
    Free download as well as online search (concordances), wide variety of languages.
       
  • Project Gutenberg –  literary texts 
    Free download (e.g. complete works of Shakespeare).
       
  • State of the Union Archive - media archive
    All Sate of the Union addresses, provided by c-span.org (transcripts, and since 1989 video clips as well).
      
  • University of Virginia eBook Library – literary texts
    Approx. 2,000 literary texts in html format.

Online text/corpus analysis tools

top

This section lists a selection of simple text analysis tools that can be used online, i.e. without installation. These tools allow you to create e.g. concordances, wordlists, text profiles from your own texts or from web pages of your choice. 
  • Compleat Lexical Tutor ('text-based concordances' section) - analyse your own text:
    - KWIC concordance for each word in the text.
    - See also 'phrase extractor' section to build concordance with word clusters.
         
  • Edict Virtual Language Centre ('Word Frequency Text Profiler' section) - analyse your own text:
    - Compares the text against well-known word lists (1000/2000 most frequent English words and others).
    - Highlights words of different frequency bands in different colours.
    - See also 'Unique Words Text Profiler' (finds all words which occur only once in a text).
        
  • Spaceless – analyse a text or web page of your choice:
    - Returns a variety of word lists.
        
  • TurboLingo - amalyse a text or web page of your choice:
    - KWIC concordance for all words in the text/web page
    - Frequency lists and other features

Offline text/corpus analysis tools

top

This section lists software packages that are commonly referred to as concordancers. They provide a more comprehensive range than the online analysis tools listed above (usually creation of concordances, alphabetical and frequency word lists, comparison of word lists and other statistical functions). Most packages can be freely downloaded but require installation. 
  • AntConc - free; by L. Anthony
    - For Windows and Linux.
    - Reads text, html, and xml files.
    - Main functions: concordances, citation of search term in its co-text, collocates, word clusters, frequency lists, text profiling through key rod lists.
         
  • ConcApp - free; by C. Greaves 
    - For Windows.
    - Main functions: concordances, collocate search, frequency lists.
        
  • Concordance - by R.J.C. Watts 
    - For Windows.
    - Creates a complete concordance for each word in a corpus and supports 
      its publication as a web concordance.
    - Other functions: individual concordances, citation of search term in its co-text, 
      frequency lists, text profiling through key rod lists, and a range of other statistical functions.
         
  • KwicFinder - by W.H. Fletcher 
    - For Windows.
    - Different from the other packages in that it focusses on the analysis of web pages.
        
  • MonoConc Pro - by Michael Barlow/Athelstan.
    - For Windows.
    Very comprehensive package.
        
  • Simple Concordance Program free; by A. Reed 
    - For Window and Mac.
    - Main functions: concordances, citation of search term in context, frequency lists.
        
  • TextSTAT - free; by M. Huening
    - For Windows, Linux and Mac.
    - Reads text, html, Word and Open Office files.
    - Web spider facility for corpus creation directly from Internet sources.
    - Main functions: concordances, citation of search term in context, frequency lists.
         
  • Wordsmith Tools - by Mike Scott
    - For Windows.
    - Very comprehensive package. 

Further resources 

top

This section focusses on corpus-related resources for the learning and teaching context. 

Corpus linguistics websites 

top

The following websites include resources and link collections generally related to corpus linguistics. 

back to top

S.Braun (at) surrey.ac.uk

updated 03/06/06