Tools
& websites
|
|
This page offers information about some common corpus tools and links to
resources on the web.
|
This section links to corpora that can be freely searched online. Each of them
comes with their own search engine/interface and with different features. Some
of the websites offer search in more the one corpus.
NB: This section focusses on the features
available online. The corpora themselves (e.g. Bank of English, British National
Corpus, Brown Corpus) are briefly described in the English
corpora section.
- Bank
of English sampler search in a 56 million word subset
of the Bank of English:
- Search by word, phrase, wildcard, part of speech or a
combination of these.
- KWIC concordances of variable length (concordance output
restricted to 40 lines).
- Collocation sampler to retrieve a word's most significant
collocates.
- British
National Corpus (BNC) - sample search in the BNC at the BNC website:
- Search by word, phrase, wildcard, part of speech or a
combination of these.
- Sentence concordances (output restricted to 50 samples).
Also available for the BNC:
- PIE
(Phrases in English) web interface based on BNC
phrases, by W.H. Fletcher:
- Search for frequently co-occuring words of 2 to 8 words
length (word clusters).
- Search all clusters of a particular length or clusters
containing a particular word, phrase or part of speech.
- Cluster lists with frequency statistics, and KWIC concordances
of the clusters.
- VIEW
(Variation in English Words and Phrases) web interface
for the BNC, by M. Davies:
- Search by word, phrase, wildcard, part of speech or a
combination of these.
- Search in the entire corpus as well as genre-specific searches.
- Frequency statistics, collocates and KWIC concordances.
- Compare quasi-synonyms or other related words and their
collocates.
- Business
Letter Corpus search in business letters and some other
texts, by S. Yasumasa:
- Search by word, phrase or wildcard.
- KWIC concordances of variable length.
- Compleat
Lexical Tutor ('corpus-based concordance' section) - search in a range of corpora, in particular Brown
Corpus and a 2 million word subset of the BNC as well as a
range of smaller corpora:
- Search by word, phrase or wildcard.
- KWIC concordances of variable length, collocate frequencies.
- Gapped KWIC concordances as a basis for exercises.
- Corpuseye search in different types of corpora,
especially The Wikipedia as a corpus:
- Search by words or phrases.
- KWIC concordances, collocate frequency.
- Morphosyntactic analysis analysis of concordance lines.
- Edict
Virtual Language Centre Web Concordancer
search in a range of corpora,
especially Brown Corpus, LOB as well as literary and other texts (The Times, Hitchhiker's Guide to the Galaxy, King James
Bible, Starr Report)
- Search by word, phrase or wildcard
- KWIC concordances of variable length, collocate frequencies,
sentence concordances
- Gapped KWIC concordances as a basis for exercises
- Collocational frameworks
- ELISA
- English Language Interview Corpus as a Second-Language
Application - a small audiovisual corpus of spoken English
developed with pedagogical goals:
- Easy access to full interview text and videos
- Browse corpus by topic index
- Online concordancer (KWIC and sentence format, search by word,
phrase or wildcard)
- Ready-made concordance of all words in the whole corpus and in
each interview
- Ready made frequency lists word the whole corpus and each
interview
- MICASE
- Michigan Corpus of Academic Spoken English - search according to a
range of criteria:
- Browse according to specified speaker and speech event
attributes (file references)
- Search by word or phrase in specified contexts (KWIC
concordances)
- WebCorp search in the entire Web as the corpus (basis: Google)
- Search by word, phrase or wildcard
- KWIC cconcordances, word lists, some good advanced features
- Disadvantage: not language-specific
|
Online full-text search in books
|
|
|
The archives listed below offer a variety of texts and smaller corpora for
download. To search them with corpus analysis methods, you will normally need an
offline text/corpus analysis tool, i.e. a concordancer.
Alternatively, you may be able to carry out some simple analyses with online
text analysis tools.
- American Rhetoric
project media archive
More than 5000 full text, audio and (streaming) video versions of
public speeches, sermons, legal proceedings, lectures, debates,
interviews, other recorded media events.
- Internet
Archive - media archive
A digital library of Internet sites and other cultural artifacts
in digital form (text, audio, video).
- Literary
Web Concordances literary texts
Free online search (concordances and a range of interesting features).
- Online
Books Page (University of Pennsylvania) literary
texts
Free access to texts in different formats (meta search in a number
of archives).
- Oxford Text Archive
literary texts
Free download as well as online search (concordances), wide
variety of languages.
- Project Gutenberg
literary texts
Free download (e.g. complete works of Shakespeare).
- State
of the Union Archive - media archive
All Sate of the Union addresses, provided by c-span.org (transcripts,
and since 1989 video clips as well).
- University of Virginia
eBook
Library literary texts
Approx. 2,000 literary texts in html format.
|
Online text/corpus analysis tools
|
|
|
This section lists a selection of simple text analysis tools that can be used online, i.e. without installation.
These tools allow you to create e.g. concordances, wordlists, text profiles from
your own texts or from web pages of your choice.
- Compleat
Lexical Tutor ('text-based concordances' section) - analyse
your own text:
- KWIC concordance for each word in the text.
- See also 'phrase extractor' section to build concordance with
word clusters.
- Edict
Virtual Language Centre ('Word Frequency Text Profiler'
section) - analyse your own text:
- Compares the text against well-known word lists (1000/2000 most
frequent English words and others).
- Highlights words of different frequency bands in different
colours.
- See also 'Unique Words Text Profiler' (finds all words which
occur only once in a text).
- Spaceless
analyse a text or web page of your choice:
- Returns a variety of word lists.
- TurboLingo
- amalyse a text or web page of your choice:
- KWIC concordance for all words in the text/web page
- Frequency lists and other features
|
Offline text/corpus analysis tools
|
|
|
This section lists software packages that are commonly referred to as concordancers.
They provide a more comprehensive range than the online analysis tools listed
above (usually creation of concordances, alphabetical and frequency word lists,
comparison of word lists and other statistical functions). Most packages can be
freely downloaded but require installation.
- AntConc
- free; by L. Anthony
- For Windows and Linux.
- Reads text, html, and xml files.
- Main functions: concordances, citation of search term in its
co-text, collocates, word clusters, frequency lists, text
profiling through key rod lists.
- ConcApp
- free; by C. Greaves
- For Windows.
- Main functions: concordances, collocate search, frequency lists.
- Concordance
- by R.J.C. Watts
- For Windows.
- Creates a complete concordance for each word in a corpus and
supports
its publication as a web concordance.
- Other functions: individual concordances, citation of search
term in its co-text,
frequency lists, text profiling through key
rod lists, and a range of other statistical functions.
- KwicFinder - by W.H. Fletcher
- For Windows.
- Different from the other packages in that it focusses on the
analysis of web pages.
- MonoConc Pro - by
Michael Barlow/Athelstan.
- For Windows.
Very comprehensive package.
- Simple
Concordance Program free; by A. Reed
- For Window and Mac.
- Main functions: concordances, citation of search term in
context, frequency lists.
- TextSTAT
- free; by M. Huening
- For Windows, Linux and Mac.
- Reads text, html, Word and Open Office files.
- Web spider facility for corpus creation directly from Internet
sources.
- Main functions: concordances, citation of search term in
context, frequency lists.
- Wordsmith
Tools - by Mike Scott
- For Windows.
- Very comprehensive package.
|
|
This section focusses on corpus-related resources for the learning and teaching
context.
|
Corpus linguistics websites
|
|
The following websites include resources and link collections generally
related to corpus linguistics.
|
|
|