In this six-part series, Jamie Keddie asks, 'What is a corpus?' and invites us to think about how we might use corpora in the classroom.

Photo of a very old looking English dictionary.

Source: Stevo24, Getty Images/iStockphoto

What is a corpus?

In order to answer this question, let’s go back to the year 1755. The great lexicographer, Samuel Johnson, has just completed the heroic task of writing the most influential dictionary in the history of the English language. One notable feature of the 42,773 entries in his work is that they are accompanied by both definitions and literary quotations. For example,


Wealth; riches; affluence

“There in full opulence a banker dwelt,
Who all the joys and pangs of riches felt;
His sideboards glitter’d woth imagin’d plate,
And his proud fancy held a vest estate.”
- Jonathan Swift

For Dr Johnson, it was these literary quotations – or ‘illustrations' as they are usually referred to – that carried the weight: the specimens came first; analysis came later. This meant that the largest part of his 10-year task involved ploughing through huge quantities of texts.

Writer Henry Hitchings hints at the state of Johnson’s work area when he says, ‘The garret at 17, Gogh Street [his study] … became a sort of backstreet abattoir specializing in the evisceration of books; traumatized volumes lay all around.' These ‘traumatized volumes' comprised Dr Johnson’s corpus – a large body of texts that exists primarily for linguistic research.

Modern corpora

250 years later, the corpus principle remains the same. Thankfully, however, technology has tidied things up for us. A modern corpus will generally have the following characteristics:

  • Texts are stored electronically. Databases can usually be accessed online.
  • Corpora can be very large. The World English Corpus (mentioned above), which was used to compile the Macmillan English Dictionary, for example, contains over 200 million words. 
  • Users normally have to pay a subscription charge in order to use a corpus. However, this is not always the case (see below).
  • Unlike Dr Johnson’s corpus, an electronically stored corpus has a search facility. This is a very important feature that would have saved our 18th century lexicographer a lot of time.

One example of a modern corpus is the World English Corpus, a unique corpus of over 200 million words from spoken and written sources. This was used in the creation of the Macmillan Dictionary. Find out more about this corpus here:

Uses of corpora

One very useful feature of the online Macmillan Dictionary is that the most frequent 7,500 words in English appear in red, along with a star rating. These are the target words that any learner who wants to succeed at advanced level should aim for. Three-star words are the most common 2,500 words in the language, two-star words are the next most common 2,500, and one-star words are again the next most common 2,500.

In order to decide upon these ‘red words’, the dictionary writers will have made use of word frequency data obtained from corpora. In fact, this is one of the most basic functions of a corpus – to identify the most common words and items that it contains within its texts.

Another possibility for investigating language is to use a concordancer. This is a piece of software that searches the corpus and lines up contextualized instances of the word or item under investigation (as well as providing additional data about it). Concordancers are invaluable to lexicographers for a huge range of analytical studies including collocational analysis.

There are no limits to the type of linguistic research that can be carried out using a corpus. However, although the principle is simple, Michael Rundell points out that ‘you may have the best corpus and the best lexicographers in the world, but you won’t produce a good dictionary unless you apply well-thought-through principles to the complex process of analyzing corpus data and converting it into useful, relevant, and easy-to-use dictionary text.'

Types of corpora

The character of a corpus is determined by the type of texts that constitute it. Whereas Dr Johnson’s corpus consisted largely of works by Shakespeare, Milton, Dryden and other literary figures, a modern general corpus will contain both written and transcribed spoken material from a wide range of media such as:

  • Books
  • Magazines
  • Newspapers
  • Emails
  • Websites
  • Television
  • Radio
  • Conversations

Linguistic investigation will often require the analysis of specialised texts, and the corpora that a researcher uses or creates may reflect this. A corpus could, for example, consist entirely of any of the following:

  • Samples of written US English
  • Samples of spoken British English
  • Business correspondence
  • Legal contracts
  • Old English
  • Children’s speech

Here are three other types of corpora that are worth mentioning:

The learner corpus: Just as it sounds, this is a database of samples of English (or any language) that have been produced by learners. The writers of the Macmillan Dictionary used such a corpus in order to identify the most common problems that learners experience when using English.

The multilingual corpus: The two corpora that we have mentioned so far (Dr Johnson’s and the World English Corpus) are both monolingual in that they are made up entirely of texts in English. However, a corpus can consist of texts in two or more languages and provide translators with an effective tool for finding the equivalent ways in which different languages express similar ideas.

Non-conventional corpora: If we define a corpus as nothing more than a large database of texts with a search facility, then we suddenly realize that most of us use corpora every day. If you have ever used the search window on the home page of onestopenglish to look for specific articles and/or lesson plans, then you have used the corpus principle. Similarly, the Windows Live Hotmail email programme has a search window which allows me to quickly locate previously received or sent emails as and when I need to (this has recently changed my life, since my inbox can be compared to an electronic version of Dr Johnson’s garret). In fact, the biggest corpus of all is the World Wide Web itself, and although it has not been specifically created for linguistic investigation, its usefulness for this purpose should by no means be disregarded.

Corpora in the classroom: a look to the series

So how does all this affect the humble language learner? Well, that is exactly what I would like to address throughout the next six articles in this series. We will see that, with a bit of thought, creativity and sensitivity to learners’ attitudes and needs, teachers can exploit corpora:

  • to create motivating classroom activities
  • to enhance their own and their learners’ linguistic understanding
  • to answer questions such as, ‘What is the difference between no and not?'
  • to create investigative homework activities
  • to promote learner autonomy

The first three articles in this series will focus on non-conventional corpora such as lyric search sites, film scripts and internet search engines. Later, we will examine the diverse possibilities for conventional corpora in the classroom. Finally, we will see how easy and advantageous it can be for your learners to develop their own personal corpus.