Normalizing Word Counts
- Resources Statistics
One of the things we often do in corpus linguistics is to compare one corpus (or one part of a corpus) with another. Let’s imagine an example. We have 2 corpora: Corpus A and Corpus B. And we’re interested in the frequency of the word boondoggle. We find 18 occurrences in Corpus A and 47 occurrences in Corpus B. So we make the following chart:
The problem here is that unless Corpus A and Corpus B are exactly the same size this chart is misleading. It doesn’t accurately reflect the relative frequencies in each corpus. In order to accurately compare corpora (or sub-corpora) of different sizes, we need to normalize their frequencies.
Let’s say Corpus A contains 821,273 words and Corpus B contains 4,337,846 words. Our raw frequencies then are:
Corpus A = 18 per 821,273 words
Corpus B = 47 per 4,337,846 words
To normalize, we want to calculate the frequencies for each per the same number of words. The convention is to calculate per 1000 or perhaps 10,000 words for smaller corpora and per 1,000,000 for larger ones. The Corpus of Contemporary English, for example, uses per million calculations in the chart display for comparisons across text-types.
Calculating a normalized frequency is a fairly straightforward process. We actually encounter normalized data all of the time. Percentages are simply data that is normalized per 100. Some corpora (like Google’s Ngram Viewer) actually normalize by percentage (or per 100 word). When presenting corpus data, we don’t usually (and I emphasize usually here) do that because it can be difficult to get your head around what the normalized number is showing because it’s usually so small. Just look at the Ngram Viewer… What is the 0.000180% on the y-axis telling us? Is that large? Or small? So it is standard practice to normalize to a larger number:
We have 18 occurrences per 821,273 words, which is the same as x (our normalized frequency) per 1,000,000 words. Solving is easy. We divide the occurrences of the token (18) by the total number of tokens in the corpus (821,273). Then we multiply that number by our normalizing factor — in this case a million, for a percentage it would be a hundred.
You can use the calculator below to see how this works. Just input the relevant numbers without commas.
For our example, we can see how this affects the representation of our data:
The raw frequencies seemed to suggest that boondoggle appeared more than 2.5 times more in Corpus B. The normalized frequencies, however, show that boondoggle is actually twice as frequent in Corpus A.
Keep in mind, too, that if you are comparing your results to those in another corpus (say COCA), you should use the same normalizing factor — normalize to the same number (e.g., one million words). That way, you’re comparing apples to apples.