## Understanding Keyness

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called *keyness*.

Imagine two highly simplified corpora. Each contains only 3 different words (*cat*, *dog*, and *cow*) and has a total of 100 words. The frequency counts are as follows:

- Corpus A:
*cat*52;*dog*17;*cow*31 - Corpus B:
*cat*9;*dog*40;*cow*31

*Cat* and *dog* would be key, as they are distributed differently across the corpora, but *cow* would not as its distribution is the same. Put another way, *cat* and *dog* are distinguishing features of the corpora; *cow is* not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a **chi-squared test** ( or ** **test) and **log-likelihood**. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word *lovely*. Here is their data:

lovely | All other words | Total words | |
---|---|---|---|

Males | 414 | 1714029 | 1714443 |

Females | 1214 | 2592238 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC**

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

*O* is the **observed frequency**, and *E* is the **expected frequency** if the independent variable (in this case sex) had no effect on the distribution of the word. The is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: **observed** minus **expected** squared, divided by **expected** for (1) *lovely* used by males; (2) other words used by males; (3) *lovely* used by females; and (4) other words used by females.

In other words, our main calculations are for the values in **red**; we have a 2×2 contingency table. The totals–the peripheral table values in **green**–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the **row total** (R) times the **column total** (C) divided by the **number of words in the corpus** (N).

The **expected** value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

lovely | All other words | Total words | |
---|---|---|---|

Males | 647.91 | 1713795.09 | 1714443 |

Females | 980.09 | 2592471.91 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC**

Now we can finish our calculations. For each of our table cells, we need to subtract the **expected** frequency from the **observed** frequency; multiply that value by itself; then divide the result by the **expected** frequency. The calculations for each cell would look like this:

lovely | All other words | Total words | |
---|---|---|---|

Males | ((414 - 647.91) * (414 - 647.91)) / 647.91 | ((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.09 | 1714443 |

Females | ((1214 - 980.09) * (1214 - 980.09)) / 980.09 | ((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.91 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 3: Calculating the chi-square values for our frequencies**

When we complete those calculations, our contingency table looks like this:

lovely | All other words | Total words | |
---|---|---|---|

Males | 84.45 | 0.03 | 1714443 |

Females | 55.83 | 0.02 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 4: Chi-square values for our frequencies**

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

Now the question is: What does this number tell us? Typically, we determine the significance of keyness values in one of two ways. First, sometimes corpus linguists just look at the top key words (maybe the top 20) to explore the most marked differences between two corpora. Second, we can find the** p-value**.

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, *df* = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

.1 | .05 | .01 | .005 | .001 | .0001 |
---|---|---|---|---|---|

2.71 | 3.84 | 6.63 | 7.88 | 10.83 | 15.14 |

**Table 5: Chi-square distribution table showing critical p-values for**

*df*= 1The critical cutoff point for statistical significance is usually at p<.01 (though it can also be p<.05). So a chi-square value above 6.63 would be considered significant. Our value is 140.3, so the distribution of *lovely* is highly significant (p<.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

I want to leave you with a couple of tips, questions, and resources:

- Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness.
- Typically, our chi-square tests in corpus linguistics will involve a 2×2 contingency table (with a degree of freedom of one); however, this isn’t always the case. We might be interested in, for example, distributions of multiple spellings (e.g.,
*because*,*cause*,*cuz*,*coz*) that would involve higher degrees of freedom. - Why would comparing larger corpora tend to produce results with larger chi-square (or keyness) values?
- Anatol Stefanowitsch has an excellent entry on chi-square tests here.
- There is a nice chi-square calculator here.