Jamie Keddie provides a novel approach to using the Internet to research differences in language usage.

In this article I will look at some original ways in which internet search engines can be used to strengthen our language learners’ understanding of certain aspects of vocabulary, grammar and language in general.

The problem with choice

Recently I used the following joke with a class of particularly inquisitive learners:

Two campers were hiking in the forest when all of a sudden a bear jumps out of a bush and starts chasing them. Both campers start running for their lives, when one of them stops and starts to put on his running shoes.His partner says, 'What are you doing? You can't outrun a bear!'. His friend replies, 'I don't have to outrun the bear, I only have to outrun you!'

Do you like the joke?  My students did but they seemed a bit confused about one aspect of the grammar: inconsistency in the verb to start:

starts chasing…   Verb followed by an -ing form
start running…   Verb followed by an -ing form
starts to put on…  Verb followed by an infinitive

The only thing I could do explain the situation was to point out that the verb in question is flexible. I told them that there are a few other verbs like this – to begin, to continue and to like, for example. But then a new question arose: which is more common – start to do something or start doing something?

In my experience as a teacher, this is a common type of learner question. I have come to the conclusion that language learners don’t always like choice. I remember when I had been learning Spanish for a few months, I met the following two structures:

Lo estoy haciendo = I am doing it

Estoy haciendolo = I am doing it

I wasn’t interested in learning both structures. That seemed like a waste of time to me. I wanted simplicity and that meant selecting one of the structures over the other, focusing on it and making it my own. Native language speakers surely have personal preferences for certain vocabulary and grammar. Why shouldn’t non-natives?

Google searches

The 'Which is more common?' question is more complex than it might seem. Among the many factors that may or may not have to be taken into account when considering it are:

  • Individual tendencies to use one item of vocabulary/grammatical structure over another
  • Different dialects
  • Register (formal/informal, written/spoken, etc)
  • Genre (shopping list, song lyrics, mobile phone message, onestopenglish article, etc.)
  • Differences in meaning (this will be discussed later)

Despite the complexity, a Google search is a good, practical way of dealing with the question. Simply type in an item and take a note of the number of hits. In response to the bear joke, my students and I decided to compare the World Wide Web frequencies of the following word associations:

Table: Google Fight

If you have access to a computer, it will take approximately two minutes to make a chart for the results in Microsoft Excel.

Googlefight table

 Results like these should be taken with a pinch of salt (i.e. not be taken too seriously) and we will see why shortly. But despite this, they do provide students with tangible evidence about their new language and sometimes it is important to see things for yourself rather than take someone else’s (i.e. the teacher’s) word for it.

Google fight

This site allows you to pair up words or phrases and let them contend against each other for Google hits (although not affiliated with Google, the Google fight site still makes use of the Google search engine to find its results).

Enter the rival items into the separate windows and click on 'make a fight'. Following a fight between a pair of matchstick men, you are given a graph which shows you the number of results obtained for each item. The item that returns the greatest number of hits is the winner.

The site itself has a number of suggestions such as:

Table: Google2

Because of its visual aspect and fun nature, this site is usually popular with students. By the way, when you type multi-word items into Google and Google fight, make sure you add inverted commas ('  ').  They keep the words together.

Here are some more 'Which is more common?' questions that have arisen in my classes recently that were turned into Google search activities:

Table: Google3

If a 'Which is more common' question arises in your class, why not prepare an exercise that your students can carry out in class (if you have Internet access) or do as homework?. For example:

Type the following items into the Googlefight website or the Google search engine and take a note of the number of hits that you observe in each case:

Table: Google4

Anchor Point:4

Differences in meaning

I know what you may be thinking at this stage: that in many of the above pairs of language items there is a specific difference in meaning. To think about something is certainly not always the same as to think of something. You can probably distinguish in some way between like doing something and like to do something. And, given half the chance, most of us will probably be more than happy to go into a lengthy explanation of the difference between I saw you play basketball and I saw you playing basketball.

Native speakers of any language are able to perceive slight differences in meaning such as these through years and years of constant exposure to their language. But we have to be aware that such differences are often intangible and too subtle to explain to learners, especially with no given context.  Sometimes, the language explanations we offer our students are no more useful than a description of the colour blue to a person who has been blind from birth.

A Google fight may be of limited value to a learner. But, unlike teacher explanations, they are empirical and objective. They allow students to see things for themselves and form their own opinions. And once students have been shown how to carry them out, the teacher’s presence is not even necessary. This is good for promoting learner autonomy.

Using Google for language investigation is by no means an original idea. The Internet is what Michael Rundell has referred to as 'the biggest corpus of all'.  But what is a corpus?

Corpora and corpora linguistics

A corpus, or text corpus, is a huge database of millions of words of written and spoken language that has been compiled for the purpose of linguistic research.

A general corpus such as the British National Corpus for example, will contain excerpts from newspapers, magazines, literature and the internet as well as transcribed conversations, radio presentations, advertisements, etc. The British National Corpus allows non-subscribers to carry out free 'simple searches'. To do this go to Enter a word or item of your choice and click Solve it! The search will tell you the number of times your item was found in the corpus (database). This is followed by up to 50 random examples of the item in context (all taken from the corpus). Try this now.

To give you an example, I have just typed in the word 'funny', clicked Solve it! and been informed that 4,315 solutions (hits) were found in the corpus. Among the 50 contextualized examples of the word that I have been given are:

   … he choked, went a funny colour, ripped his collar open, waved his arms a bit, and dropped down dead.
  The preview for Dirty Rotten Scoundrels shows con man Steven Martin suavely pushing a hapless granny into the sea, but the final film doesn’t: perhaps it was too funny to be included.
  These results could be used in class to demonstrate, for example, that the word 'funny' has two different meanings – funny peculiar and funny ha ha.

You may have noticed that modern dictionaries (and many grammar books) are 'corpus informed'.  This means that dictionary writers are no longer sitting around large tables and arriving at word definitions based on their own internal and personal ideas and understandings. Instead, they are looking externally to language usage itself. They use corpora (plural) to examine language in use.  This is the principle behind corpus linguistics.

Potential problems and pitfalls

I am a big fan of the use of corpora in language learning and many of my students have come to feel the same way. But for others, the use of corpora in the classroom can all seem a bit academic and the enthusiasm is not shared. I have found, however, that practically all students enjoy the Google fights that were described previously.

However, at this point, a word of warning is needed. Unlike a real corpus, the Internet was never designed for language investigation. Speaking of the web, Michael Rundell says:

  '… some text-types are very well represented, and others are hardly present at all.  Contemporary fiction, for instance, exists only in tiny amounts on the web, but any respectable general corpus would include a significant         percentage of this important and influential text-type.'

Here are some more potential pitfalls:

1) US English
The language of the Internet is heavily weighted towards American English:

colour  [UK spelling]  128 million hits  
color   [US spelling]  539 million hits

2) Written English
Consider the following Google fight:

“I was arrested”  452,000 hits
“I got arrested”  278,000 hits

This result may cause us to believe that 'I was arrested' is a more common structure than 'I got arrested'.  But get-passives are more common in spoken English than in written English. Spoken English is underrepresented on the web and so the above results may be misleading.

3) Multiple word meanings
Another trap that we can fall into can be demonstrated by the following search:

sweater   20.9 million hits
jumper   20 million hits
jersey    242 million hits
pullover   14.3 million hits

These results cannot reflect the true frequency of use on the Internet of these four items of clothing.  The search will not distinguish between jersey as a jumper and Jersey as a channel island or jumper as a jersey and jumper as an athlete. If we were using a real corpus, created and designed for language research, we could get around this problem quite simply. But on the internet, there is no way of being sure that our results are not influenced by all sorts of unforeseen factors. Results can also be affected by the strong presence on the web of names of products, films and songs, slogans and general computer jargon.

4) More unforeseen factors
One more pitfall that awaits us when we use search engines for language investigation can be exemplified by the following hit that was obtained when 'think about' was entered into Google:

  Today’s were average, 94 new cases I thinkAbout half of these come from suspected cases or those in quarantine, but that still leaves about…

Make sure your learners are aware of these potential traps and, as has already been said, take all results with a pinch of salt. The result of a Google fight is nothing more than a learner’s rule of thumb (a rule that exists to guide but is not necessarily 100% true).

Another Google fight idea

Every now and again, a student will bring to everyone’s attention an example of maverick English that he/she has come across in a film, magazine, advert, etc. Rather than discard this as 'incorrect' language, it can be a good idea to introduce the idea of standard and non-standard English.

A Google fight can be a very good way of demonstrating the difference in frequency of use. Here are some examples:

themselves      340 million hits
theirselves      237,000 hits

'I did it'      13.6 million hits
'I done it'      182,000 hits

'he gave me it'     10,900 hits
'he gave it me'     9,730 hits

'if I were'      12.3 million hits
'if I was'      6.2 million hits

'it doesn’t'      141 million hits
'it don’t'      1.4 million hits

'if it happened I would'    634 hits
'if it would happen I would'    205 hits

Since the Internet is constantly changing, the results obtained for Google fights will also vary slightly from day to day.  The figures that I have given throughout this article are those that I obtained at the end of November, 2006.

Further investigation

Humanising Language Teaching ( is a free online journal that is run by Pilgrims, a well-known UK language school. It comes out every two months and has a section called 'Corpora Ideas'.

The Michael Rundell article from which I have quoted twice ('the Biggest Corpus of all') can be seen in the May 2000 issue (year 2, issue 3).

