## Understanding Keyness

**Resources**Statistics

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called *keyness*.

Before we get to any mathematical explanations, here a few important things to keep in mind:

- Whether calculated by log-likelihood or a chi-squared test, keyness is a significance test, meaning that it tests the amount of evidence we have for the existence of an effect. It
**does not**tell us the size of that effect. - Chi-squared tests and log-likelihood tests are sensitive to the size of the corpora we are testing (our sample size). In essence, when we use really large corpora, we are likely to generate a lot of keywords with high keyness values. As Gries (2010) observes, “[M]any contemporary corpora provide basically guarantee that even minuscule effects will be highly significant.”
- For these reasons, it is important to recognize what keyness values
**aren’t**telling us and to use effect sizes in combination with keyness values (particularly as your familiarity with corpus techniques advances). I am partial to Andrew Hardie’s**log ratio**. He has a nice explanation here. There are, however, other measures like Effect Size for Log-Likelihood (ELL), Odds Ratio, and Relative Risk. Paul Rayson at Lancaster has created a downloadable Excel spreadsheet with these functions.

Imagine two highly simplified corpora. Each contains only 3 different words (*cat*, *dog*, and *cow*) and has a total of 100 words. The frequency counts are as follows:

- Corpus A:
*cat*52;*dog*17;*cow*31 - Corpus B:
*cat*9;*dog*40;*cow*31

*Cat* and *dog* would be key, as they are distributed differently across the corpora, but *cow* would not as its distribution is the same. Put another way, *cat* and *dog* are distinguishing features of the corpora; *cow is* not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a **chi-squared test** ( or ** **test) and **log-likelihood**. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word *lovely*. Here is their data:

lovely | All other words | Total words | |
---|---|---|---|

Males | 414 | 1714029 | 1714443 |

Females | 1214 | 2592238 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC**

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

*O* is the **observed frequency**, and *E* is the **expected frequency** if the independent variable (in this case sex) had no effect on the distribution of the word. The is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: **observed** minus **expected** squared, divided by **expected** for (1) *lovely* used by males; (2) other words used by males; (3) *lovely* used by females; and (4) other words used by females.

In other words, our main calculations are for the values in **red**; we have a 2×2 contingency table. The totals–the peripheral table values in **green**–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the **row total** (R) times the **column total** (C) divided by the **number of words in the corpus** (N).

The **expected** value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

lovely | All other words | Total words | |
---|---|---|---|

Males | 647.91 | 1713795.09 | 1714443 |

Females | 980.09 | 2592471.91 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC**

Now we can finish our calculations. For each of our table cells, we need to subtract the **expected** frequency from the **observed** frequency; multiply that value by itself; then divide the result by the **expected** frequency. The calculations for each cell would look like this:

lovely | All other words | Total words | |
---|---|---|---|

Males | ((414 - 647.91) * (414 - 647.91)) / 647.91 | ((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.09 | 1714443 |

Females | ((1214 - 980.09) * (1214 - 980.09)) / 980.09 | ((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.91 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 3: Calculating the chi-square values for our frequencies**

When we complete those calculations, our contingency table looks like this:

lovely | All other words | Total words | |
---|---|---|---|

Males | 84.45 | 0.03 | 1714443 |

Females | 55.83 | 0.02 | 2593452 |

Total | 1628 | 4306267 | 4307895 |

**Table 4: Chi-square values for our frequencies**

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

Now the question is: What does this tell us? Sometimes corpus linguists just look at the top key words (maybe the top 20). Alternatively, we might be interested in keywords from a particular lexical class (like modal verbs or pronouns), words that have a shared rhetorical function (like hedges or boosters), or from a lexical field (like words related to the body).

As for the number itself, we need to pay attention to the** p-value**.

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, *df* = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

.1 | .05 | .01 | .005 | .001 | .0001 |
---|---|---|---|---|---|

2.71 | 3.84 | 6.63 | 7.88 | 10.83 | 15.14 |

**Table 5: Chi-square distribution table showing critical p-values for**

*df*= 1The critical cutoff point for statistical significance is usually at p < 0.05 (though it can also be p < .01). So a chi-square value above 3.84 would be considered significant. Our value is 140.3, so the distribution of *lovely* is significant (p < 0.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

The plot below illustrates how p-values are calculated. The keyness values are along the x-axis. Probabilities (for d.f. =1) are along the y-axis. The p-values are derived from the area under the curve at specific points. At 3.84, the area under the probability curve (shaded in red) represents 5% of the total area under that curve. So the p-value = 0.05 at that point. At 6.63, the area under the curve is 1% of the total. So the p-value = 0.01. And so on. Our value for lovely is 140.3, so it is way off to right along the x-axis. The area under the curve at that point is approaching 0%.

I want to leave you with a couple of tips, questions, and resources:

- Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness or use other tools like R.
- Typically, our chi-square tests (or log-likelihood tests) in corpus linguistics will have a degree of freedom of 1. We are typically comparing one corpus (our target corpus) to another (our reference corpus). And this process is a built in function of the most popular concordancers. However, we can (and do) make comparisons with higher degrees of freedom. Remember that the degree of freedom changes the p-value thresholds.
- There is a nice chi-square calculator here.