Number one for English language teachers

Introduction to corpora

Type: Article

In this new six-part series, Jamie Keddie asks, 'What is a corpus?' and invites us to think about how we might use corpora in the classroom.

What is a corpus? Modern corpora Uses of corpora Types of corpora Corpora in the classroom References

The word corpus seems to be making more and more appearances in language teaching and language learning contexts. On the back cover of the new edition of the Macmillan English Dictionary for advanced learners, for example, you will find the following two logos:

World English Corpus logo                                    Centre for English corpus linguistics logo          

                                                                Centre for English Corpus Linguistics

As well as its association with modern dictionaries and grammar books, teachers may also have become aware of corpora (plural) through conferences, workshops and articles.


What is a corpus?

In order to answer this question, let’s go back to the year 1755. The great lexicographer, Samuel Johnson, has just completed the heroic task of writing the most influential dictionary in the history of the English language.

One notable feature of the 42,773 entries in his work is that they are accompanied by both definitions and literary quotations. For example,

Opulence

Wealth; riches; affluence

“There in full opulence a banker dwelt,
Who all the joys and pangs of riches felt;
His sideboards glitter’d woth imagin’d plate,
And his proud fancy held a vest estate.”
- Jonathan Swift


For Dr Johnson, it was these literary quotations or “illustrations” as they are usually referred to, that carried the weight: the specimens came first, analysis came later. This meant that the largest part of his 10-year task involved ploughing through huge quantities of texts.

Writer Henry Hitchings hints at the state of Johnson’s work area when he says, “The garret at 17, Gogh Street [his study]… became a sort of backstreet abattoir specializing in the evisceration of books; traumatized volumes lay all around.”

Samuel Johnson

Dr. Johnson

These “traumatized volumes” comprised Dr Johnson’s corpus - a large body of texts that exists primarily for linguistic research.


Modern corpora

250 years later, the corpus principle remains the same. Thankfully, however, technology has tidied things up for us. A modern corpus will generally have the following characteristics:

  • Texts are stored electronically. Databases can usually be accessed online.
  • Corpora can be very large. The World English Corpus (mentioned above), which was used to compile the Macmillan English Dictionary for example, contains over 200 million words. 
  • Users normally have to pay a subscription charge in order to use a corpus. However this is not always the case (see below).
  • Unlike Dr Johnson’s corpus, an electronically stored corpus has a search facility. This is a very important feature that would have saved our 18th century lexicographer a lot of time.
    If you have never done so before, why not try out a corpus search right now. The British National Corpus allows non-subscribers to carry out free 'simple searches'. Here is how to do it:

1) Go to http://www.natcorp.ox.ac.uk/.

2) Enter a word or item of your choice.
 
3) Click 'Solve it!'

4) The search will tell you the frequency of your item in the corpus. This is followed by up to 50 random examples of your item in context.


Uses of corpora

One very useful feature of the new edition of the Macmillan English Dictionary for advanced learners is that the most frequent 7,500 words in English are printed in red. These are the target words that any learner who wants to succeed at advanced level should aim for.

In order to decide upon these ‘red words’, the dictionary writers will have made use of word frequency data obtained from corpora. In fact, this is one of the most basic functions of a corpus – to identify the most common words and items that it contains within its texts.

Another possibility for investigating language is to use a concordancer. This is a piece of software that searches the corpus and lines up contextualized instances of the word or item under investigation (as well as providing additional data about it). Concordancers are invaluable to lexicographers for a huge range of analytical studies including collocational analysis.

There are no limits to the type of linguistic research that can be carried out using a corpus. However, although the principle is simple, Michael Rundell points out that “you may have the best corpus and the best lexicographers in the world, but you won’t produce a good dictionary unless you apply well-thought-through principles to the complex process of analyzing corpus data and converting it into useful, relevant, and easy-to-use dictionary text.”


Types of corpora

The character of a corpus is determined by the type of texts that constitute it. Whereas Dr Johnson’s corpus consisted largely of works by Shakespeare, Milton, Dryden and other literary figures, a modern general corpus will contain both written and transcribed spoken material from a wide range of media such as:

  • Books
  • Magazines
  • Newspapers
  • Emails
  • Television
  • Radio
  • Conversations

Linguistic investigation will often require the analysis of specialised texts, and the corpora that a researcher uses or creates may reflect this. A corpus could, for example, consist entirely of any of the following:

  • Samples of written US English
  • Samples of spoken British English
  • Business correspondence
  • Legal contracts
  • Old English
  • Children’s speech

Here are three other types of corpora that are worth mentioning:

The learner corpus: Just as it sounds, this is a database of samples of English (or any language) that have been produced by learners. The writers of the Macmillan English Dictionary used such a corpus in order to identify the most common problems that learners experience when using English.

The multilingual corpus: The three corpora that we have mentioned so far (Dr Johnson’s, the World English Corpus and the British National Corpus) are all monolingual in that they are made up entirely of texts in English. However, a corpus can consist of texts in two or more languages and provide translators with an effective tool for finding the equivalent ways in which different languages express similar ideas.

Non-conventional corpora: If we define a corpus as nothing more than a large database of texts with a search facility, then we suddenly realize that most of us use corpora every day. If you have ever used the search window on the home page of onestopenglish to look for specific articles and/or lesson plans, then you have used the corpus principle. Similarly, the new Windows Live Hotmail has a search window which allows me to quickly locate previously received or sent emails as and when I need to (this has recently changed my life since my Inbox can be compared to an electronic version of Dr Johnson’s garret). In fact, the biggest corpus of all is the World Wide Web itself and although it has not been specifically created for linguistic investigation, its usefulness for this purpose should by no means be disregarded.

See also:


Corpora in the classroom: a look to the series

So how does all this affect the humble language learner? Well that is exactly what I would like to address throughout the next six articles in this series. We will see that with a bit of thought, creativity and sensitivity to learners’ attitudes and needs, teachers can exploit corpora:

  • to create motivating classroom activities
  • to enhance their own and their learners’ linguistic understanding
  • to answer questions such as, “What is the difference between no and not?”
  • to create investigative homework activities
  • to promote learner autonomy

The first three articles in this series will focus on non-conventional corpora such as lyric search sites, film scripts and internet search engines. Later, we will examine the diverse possibilities for conventional corpora in the classroom (the British National Corpus, for example). Finally, we will see how easy and advantageous it can be for your learners to develop their own personal corpus.

References

Hitchings, Henry (2005) Dr Johnson’s dictionary: The extraordinary story of the book that defined the world. John Murray

Michael Rundell describes how the Macmillan English Dictionary for advanced learners was written at: http://www.macmillandictionary.com/createhow.htm

Rate this resource (4.33 average user rating)

  • 1 star out of 5
  • 2 stars out of 5
  • 3 stars out of 5
  • 4 stars out of 5
  • 5 stars out of 5

You must be signed in to rate.

  • Share

Readers' comments (6)

  • The BNC has now removed the simple search function.

    Unsuitable or offensive? Report this comment

  • Hi there,

    Thanks for your feedback. As Jamie mentions in the article, the Macmillan English Dictionary for advanced learners has the 7'500 most frequently used words printed in red so this will give you an idea of which words are more likely to be needed by a learner.

    Similarly, the Macmillan Online Dictionary contains the same red words. 90% of the time, speakers of English use just these 7,500 red words in speech and writing, which are graded in the dictionary with stars. One-star words are frequent, two-star words are more frequent, and three-star words are the most frequently used words in the language.

    So, if you want to know how frequently a word occurs and whether it should be included in your vocabulary lessons, why now look it up in the online dictionary?

    http://www.macmillandictionary.com/

    Hope that helps and good luck with your teaching.

    Best wishes,

    The onestopenglish team

    Unsuitable or offensive? Report this comment

  • I've been trying to get to grips with the BNC after reading about its usefulness as a language analysis tool. However, I do not find it very user-friendly. I can bring up a range of examples of a word just by typing it in the search field, but how, for instance, can I get information about the FREQUENCY of that word? Knowing how frequently a word occurs is (and, therefore, how likely a learner is to hear and need to use it) should be the starting point for any vocabulary lesson.

    Unsuitable or offensive? Report this comment

  • Thanks im52! We have now updated this on the page.

    Unsuitable or offensive? Report this comment

  • I think I've found it. Go to: http://www.natcorp.ox.ac.uk/
    I hope I've been able to help you.

    Unsuitable or offensive? Report this comment

  • this is all very interesting but i wish the link http://sara.natcorp.ox.ac.uk/lookup.html
    worked. it doesn't. would somebody know where to find this corporate search site?

    Unsuitable or offensive? Report this comment

Have your say

You must sign in to make a comment

sign in register
An+A-

Z+of+ELT+book+cover

An A-Z of ELT

An A-Z of ELT is an alphabetical list of ELT terms and concepts

Powered by Webstructure.NET

Access denied popup