Understanding Keyness

Category : Resources, Stats!

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called keyness.

Imagine two highly simplified corpora. Each contains only 3 different words (cat, dog, and cow) and has a total of 100 words. The frequency counts are as follows:

  • Corpus A: cat 52; dog 17; cow 31
  • Corpus B: cat 9; dog 40; cow 31

Cat and dog would be key, as they are distributed differently across the corpora, but cow would not as its distribution is the same. Put another way, cat and dog are distinguishing features of the corpora; cow is not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a chi-squared test ( or χ² test) and log-likelihood. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word lovely. Here is their data:

 lovelyAll other wordsTotal words
Males41417140291714443
Females121425922382593452
Total162843062674307895
Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

 \Chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}

O is the observed frequency, and E is the expected frequency if the independent variable (in this case sex) had no effect on the distribution of the word. The \sum \!\, is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: observed minus expected squared, divided by expected for (1) lovely used by males; (2) other words used by males; (3) lovely used by females; and (4) other words used by females.

In other words, our main calculations are for the values in red; we have a 2×2 contingency table. The totals–the peripheral table values in green–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the row total (R) times the column total (C) divided by the number of words in the corpus (N).

The expected value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

 lovelyAll other wordsTotal words
Males647.911713795.091714443
Females980.092592471.912593452
Total162843062674307895
Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC

Now we can finish our calculations. For each of our table cells, we need to subtract the expected frequency from the observed frequency; multiply that value by itself; then divide the result by the expected frequency. The calculations for each cell would look like this:

 lovelyAll other wordsTotal words
Males((414 - 647.91) * (414 - 647.91)) / 647.91 ((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.091714443
Females((1214 - 980.09) * (1214 - 980.09)) / 980.09((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.912593452
Total162843062674307895
Table 3: Calculating the chi-square values for our frequencies

When we complete those calculations, our contingency table looks like this:

 lovelyAll other wordsTotal words
Males84.450.031714443
Females55.830.022593452
Total162843062674307895
Table 4: Chi-square values for our frequencies

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

  • Pearson Chi-Square Calculator
  • Chi-Square

Now the question is: What does this number tell us? Typically, we determine the significance of keyness values in one of two ways. First, sometimes corpus linguists just look at the top key words (maybe the top 20) to explore the most marked differences between two corpora. Second, we can find the p-value.

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, df = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

.1.05.01.005.001.0001
2.713.846.637.8810.8315.14
Table 5: Chi-square distribution table showing critical p-values for df = 1

The critical cutoff point for statistical significance is usually at p<.01 (though it can also be p<.05). So a chi-square value above 6.63 would be considered significant. Our value is 140.3, so the distribution of lovely is highly significant (p<.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

I  want to leave you with a couple of tips, questions, and resources:

  1. Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness.
  2. Typically, our chi-square tests in corpus linguistics will involve a 2×2 contingency table (with a degree of freedom of one); however, this isn’t always the case. We might be interested in, for example, distributions of multiple spellings (e.g., because, cause, cuz, coz) that would involve higher degrees of freedom.
  3. Why would comparing larger corpora tend to produce results with larger chi-square (or keyness) values?
  4. Anatol Stefanowitsch has an excellent entry on chi-square tests here.
  5. There is a nice chi-square calculator here.
@