Normalizing Word Counts

Category : Resources, Stats!

One of the things we often do in corpus linguistics is to compare one corpus (or one part of a corpus) with another. Let’s imagine an example. We have 2 corpora: Corpus A and Corpus B. And we’re interested in the frequency of the word boondoggle. We find 18 occurrences in Corpus A and 47 occurrences in Corpus B. So we make the following chart:

 The problem here is that unless Corpus A and Corpus B are exactly the same size this chart is misleading. It doesn’t accurately reflect the relative frequencies in each corpus. In order to accurately compare corpora (or sub-corpora) of different sizes, we need to normalize their frequencies.

Let’s say Corpus A contains 821,273 words and Corpus B contains 4,337,846 words. Our raw frequencies then are:

Corpus A = 18 per 821,273 words

Corpus B = 47 per 4,337,846 words

To normalize, we want to calculate the frequencies for each per the same number of words. The convention is to calculate per 10,000 words for smaller corpora and per 1,000,000 for larger ones. The Corpus of Contemporary English, for example, uses per million calculations in the chart display for comparisons across text-types.

Calculating a normalized frequency is a fairly straightforward process. The equation can be represented in this way:

We have 18 occurrences per 821,273 words, which is the same as x (our normalized frequency) per 1,000,000 words. We can solve for x with simple cross multiplication:

Generalizing then (normalizing per one-million words):

You can use the calculator below to see how this works. Just input the relevant numbers without commas.

  • Normalizing Calculator (do not use comma separators)

For our example, we can see how this affects the representation of our data:

The raw frequencies seemed to suggest that boondoggle appeared more than 2.5 times more in Corpus B. The normalized frequencies, however, show that boondoggle is actually twice as frequent in Corpus A.