Corpus of Presidential Speeches (CoPS) and a Clinton/Trump Corpus

21 Aug
August 21, 2016

In this election season, both linguistics and the popular press have shown an interest in candidates’ speaking styles — particularly Donald Trump’s (examples here, here, here, and here). In light of that interest, I have compiled a couple of corpora for instructors and students to use.

The first is a historical corpus. It contains data scraped from archives of speeches made by U.S. Presidents from George Washington to Barack Obama. The total word count for the corpus is about 3.5 million words. The files are in folders organized by President, making it relatively easy to do comparisons of individuals, by party, or by period. The precise breakdown by President is included in the table below, and there is a link for you to download the corpus at the end of this post.

Num.PresidentTermWord Count
01George Washington1789-179731643
02John Adams1797-180114672
03Thomas Jefferson1801-180940149
04James Madison1809-181736049
05James Monroe1817-182549960
06John Quincy Adams1825-182936472
07Andrew Jackson1829-1837157535
08Martin Van Buren1837-184164747
09William Henry Harrison18418465
10John Tyler1841-184569471
11James K. Polk1845-1849104267
12Zachary Taylor1849-185011368
13Millard Fillmore1850-185339392
14Franklin Pierce1853-185763448
15James Buchanan1857-186180883
16Abraham Lincoln1861-186595643
17Andrew Johnson1865-186998806
18Ulysses S. Grant1869-1877103060
19Rutherford B. Hayes1877-188167474
20James A. Garfield18812980
21Chester Arthur1881-188549590
22, 24Grover Cleveland1885-1889, 1893-1897155553
23Benjamin Harrison1889-189376363
25William McKinley1897-190192318
26Theodore Roosevelt1901-1909196692
27William Howard Taft1909-1913117594
28Woodrow Wilson1913-192180123
29Warren G. Harding1921-192328752
30Calvin Coolidge1923-192974333
31Herbert Hoover1929-193387888
32Franklin D. Roosevelt1933-1945132082
33Harry S. Truman1945-195336954
34Dwight D. Eisenhower1953-196118097
35John F. Kennedy1961-1963136196
36Lyndon B. Johnson1963-1969231949
37Richard Nixon1969-197466482
38Gerald Ford1974-197740446
39Jimmy Carter1977-198170388
40Ronald Reagan1981-1989196553
41George Bush1989-199371160
42Bill Clinton1993-2001144580
43George W. Bush2001-2009107737
44Barack Obama2009-2016199211

If you are planning on using this corpus, take note of a few things. First, each file has a heading, like this one:

<title=”State of the Union Address”>
<date=”January 23, 1979″>

As in the heading, angle brackets are used to isolate the speech of the president named in each file. For example, this comes from a debate between George Bush and Michael Dukakis and moderated by Jim Lehrer:

<LEHRER: Mr. Vice President, a rebuttal.>
<BUSH:> Well, I don’t question his passion. I question — and I don’t question his concern about the war in Vietnam.

This sort of “tagging” was accomplished by a simple script and may not be 100% accurate. So be sure to continually check your data. It also means that if you are using a concordancer like AntConc, you will need to set the “Tags” option under “Global Settings” to “Hide tags.”


I did this so that the context of some of these communicative events is preserved. But don’t forget to set your tag option, otherwise your data will be quite skewed.

It is also important to be aware that this corpus is fine for learning purposes and for exploratory analysis. However, if you plan on doing more statistically rigorous work, you will need to account for a few things. First, as with all historical corpora, the earlier stuff comes from written texts. (We, of course, don’t have any recordings of George Washington.) But for the later Presidents, the corpus includes things like transcripts from debates and other more interactional occasions. So if you want to do some comparisons from different time periods, you’ll want to think through how register might influence (or not) the results.

Also, I assembled this corpus quickly. If you look through the table above, you’ll see that there are differing numbers of words for different Presidents. Some of this is a byproduct of how long Presidents served in office. (William Henry Harrison served just a month, and Franklin Roosevelt served just a little over 12 years.) However, some Presidents just had more data available than others, irrespective of time. I have not attempted to compensate for these imbalances. Again, if you want to produce more statistically rigorous work, you may need to address this issue, depending upon what exactly you’re looking at.

The second corpus I’m releasing is a collection of speeches delivered at campaign events by Hillary Clinton and Donald Trump, beginning with their acceptance speeches at their respective party conventions. As with CoPS, this one uses angle brackets in headers and as tags, so be sure to set your preferences accordingly.

Because not much time has elapsed since the conventions, the corpus is still relatively small (about 30,000 words from Clinton and 125,000 words from Trump). But check back, because I will be updating the corpus between now and the election.

Good luck and enjoy.

Can we quantify “bad” writing?

27 Jul
July 27, 2016

In one of classes, my students were talking about the  Sean Penn’s article in which he interviewed the Mexican drug lord and escaped convict Joaquín Guzmán. For those unfamiliar with the piece, it was widely ridiculed after its publication. One target of that ridicule was the very style of the writing. As one outlet put it:

If you know only one thing about the Rolling Stone article itself, it should be this: The writing is terrible. Penn apparently fancies himself some combination of Bob Woodward and Hunter S. Thompson; he’ll tell you he farted when El Chapo shook his hand, but he calls it “expel[ling] a minor traveler’s flatulence.” It is the sort of self-serious dreck that no one but a celebrity could publish in a major American magazine, and that always inspires self-righteous grumbling from bona fide professional journalists.

So one of my students posed the question: If we did some basic corpus analysis, could we tell the writing is bad? This is a really interesting question. So I put together a small corpus for the purposes of comparison. Genre matters, of course. Comparing Penn’s article to famous works of fiction, for example, won’t tell us much, since the conventions of fiction are different than the conventions of magazine journalism. We need to compare apples to apples. I found a list of recommended pieces of journalism that included a set a celebrity profiles. Admittedly, this is hardly an unimpeachably objective list. But for the purposes of our small experiment, it works as a functional benchmark.

From that list, I put together the the corpus below. (Note there is an additional folder with versions of the files with the quotations extracted for checking their potential effect on the data.)

When we did a key word search with Penn’s article as the target corpus, and other profiles as the reference corpus, the pattern that jumped out was Penn’s relative overuse of first person pronouns.


Penn spends a good portion of his narrative describing his own thoughts and actions or on the combined actions of himself and a companion. This is clear in the following passage (first person pronouns in bold):

We quietly make our plans, sensitive to the paradox that also in our hotel is President Enrique Peña Nieto of Mexico. Espinoza and I leave the room to get outside the hotel, breathe in the fall air and walk the five blocks to a Japanese restaurant, where we‘ll meet up with our colleague El Alto Garcia. As we exit onto 55th Street, the sidewalk is lined with the armored SUVs that will transport the president of Mexico to the General Assembly. Paradoxical indeed, as one among his detail asks if I will take a selfie with him. Flash frame: myself and a six-foot, ear-pieced Mexican security operator.

Alternatively, if we use the profiles as the target and Penn’s article as the reference, we find a very different pattern.


Penn’s article appears to significantly under-use third person pronouns, as well as what are called reporting verbs — particularly the most common verb of journalistic attribution, said. (The gender skew of these pronouns is also eyebrow raising.)

The keywords, therefore, reveal an unconventional focus: away from the words and actions of the purported profile subject and toward the author himself. This doesn’t necessarily measure “badness.” However, it certainly points to a piece of writing that significantly strays from the norms of its genre.

Building a Tweet Sentiment Corpus

13 May
May 13, 2014

I’ve had quite a few students who are interested in using Twitter data for corpus analysis. I understand why. Twitter is widespread, and it’s relatively new communicative technology. New technologies have often been drivers of language change. So this is an inviting source of data.

But even if we decide, “Yes! Twitter is for me!” collecting the data and putting it into a format that is friendly for commonly used tools like AntConc presents some very real challenges.

Given both the possibilities and the potential hurdles, I thought I’d walk though a coffee break experiment I did this morning. First, I want to share the results I generated and what I think they tell us. Then, I’ll show you how I got there.

One of the areas of corpus research that has really taken off lately is sentiment analysis from social media data (like Twitter). Sentiment analysis attempts to measure attitudes: How do we feel about that movie? Was it good? Lousy? To do so a score is assigned, positive or negative, to a text. And it’s easy to understand why this kind of analysis is popular. It provides people with interest in how things are being talked about in the wider world (like marketers) a number with an air of scientific objectivity that says, “Hey, this is getting great buzz!” or maybe, “We had better rethink this…”

So I thought I would give this a try (again more about the specifics in a minute). Given all of news regarding the draft of Michael Sam, the first openly gay player to be drafted by the NFL, I though that I would build a corpus of tweets containing the hashtag #MichaelSam, and apply sentiment analysis. These are my results.


The technique I used assigns a score of 1 to 5 for positive sentiment (5 being the most positive) and -1 to -5 for negative sentiment (-5 being the most negative. What I got was this rather perfect bell curve. I was a bit surprised by the number of neutral tweets, but many of them are like this one:

2014 #NFL Draft: Decision day for Michael Sam

But others that scored a 0 were like these:

Men kiss men every day, all over the world.  Get. Over. It. #MichaelSam

I’m Not Forcing You To Be A Christian Don’t Force Me To Be Okay With Homosexuality Stop Shoving it down People’s Throats #MichaelSam #ESPN

These examples raise some interesting questions about how sentiment analysis works.

The particular technique I used is lexical (as most sentiment analysis is, though some researchers are working out alternatives). It applies lists of positive and negative words to the texts in the corpus and calculates its score. This raises a number of problems.

The first is polysemy. Words, of course, have multiple meanings and treating those meanings as a unified class can erase significant ambiguities. This is particularly true with contranyms (words that have opposite meanings). Contranyms are common in slang, where negative words (like bad) take on new, positive associations. This is one of my favorite examples from the Guardian.


Another issue arises from the creative respelling and punctuation that are important in technologically-mediated communication. In one of the examples above the author’s stance is communicated through the phrase “get over it,” and that stance is amplified by the insertion of periods between each word. That emphasis is going to be overlooked if we are only measuring words.

Finally, there is the difficulty of locating the object of sentiment. While a tweet may express a negative sentiment, and that tweet may contain #MichaelSam, that negative sentiment may be directed at something else. Here are a couple of examples:

People used to discriminate blacks and now people want to discriminate gays. Everyone’s welcome in the future. Just let it be. #MichaelSam

Why are parents struggling to explain #MichaelSam to their kids? In 2014 are we that afraid of homosexuality still? #GrowUpAmerica

In both of these it’s easy to see what words are being counted as negative. However, the negativity is being directed at a perceived discriminatory culture rather than at Michael Sam or the events at the draft. In that way, we might be inclined to classify them as positive rather than negative.

Sentiment analysis clearly presents challenges if you’re interested in this kind of work.

Now, how did I build my corpus? It’s not too difficult, but you’ll need to use R — a programming language that is often used in corpus linguistics. The good news is that you DO NOT need to know how to write code. You just need to follow the steps and make use of resources that have already been developed.

I based my analysis on resources provided by the Information Research and Analysis Lab at the University of North Texas. The link has tutorial videos and an R script that you can download.

You will also need R Studio (which is a programming interface), the opinion lexicon from University of Illinois Chicago, and you will need to sign up with the Twitter API. This last bit may sound technical, but it isn’t really. Twitter doesn’t let unidentified users to grab its data like it used to. You just need to sign in with a Twitter account, click a button to create an App. (You can call it anything; it doesn’t matter.) Then, you are provided keys, a series letters and numbers that gives you access to the Twitter stream. Don’t worry, the video tutorials will walk you through the process.

Finally, a couple of tips if you’ve never seen or worked with code before. First, the videos will have you assemble three corpora about baseball. A snippet of the code looks like this:

Rangers.list <- searchTwitter('#Rangers', n=1000, cainfo="cacert.pem")
Rangers.df = twListToDF(Rangers.list)
write.csv(Rangers.df, file='C:/temp/RangersTweets.csv', row.names=F)

See #Rangers in blue between the single quotes in line 1? That is telling your code to search the Twitter stream for tweets with that hashtag. Once you go through the process and get the hang of it, you can change what’s between the quotation marks to whatever you want. Be aware, however, that you must have something there. That variable is required.

Also, you can’t just run the code without doing anything. You will need to tell R where to find things on your computer and where to put them. This requires defining what is called the “path.” The path just identifies the location of a file. You can see a path in line 3 above between the single quotation marks after “file=”. You can also see it the example below in lines 2 and 3 between the single quotation marks starting with “C:/”.

#Load sentiment word lists
hu.liu.pos = scan('C:/temp/positive-words.txt', what='character', comment.char=';')
hu.liu.neg = scan('C:/temp/negative-words.txt', what='character', comment.char=';')

You are going to have to edit these paths in order to successfully execute the code. Fortunately, this is easy. To find the location of a file on Windows, right-click on the file and select “Properties.” The path is listed there, and I’ve highlighted it in yellow.


On a Mac, control+click on the file and select “Get Information.” Again, I’ve highlighted the path in yellow.


Note that with Windows, the path (as long as the file is on your hard drive) will start with “C:”. That is just the letter designation that Windows assigns to your main drive. On a Mac, this is different. There is no letter.

In either case, you can just copy the path and paste it into the code. The copied path, however, will not contain the name of the file itself. So you will need to type in a backslash (/) and the name of the file: RangersTweets.csv, positive-words.txt, or whatever.

Otherwise, you should be able to just follow along with the tutorial. If something gets messed up, you can re-download the code and start fresh.


What Counts as “Good” Writing in High School?

09 May
May 9, 2014

The question posed for this post is not an easy one to answer. But I want to propose an at least partial one based on some data that I and a colleague (Laura Aull at Wake Forest) have been working on. The data comes from Advanced Placement Literature and Advanced Placement Language exams. Altogether, we have about 400 essays (~300 Language essays and ~100 Literature essays) totaling about 150,000 words.

We digitized the essays and used a tagger to code them for part-of-speech. After tagging the corpus, we ran some statistical analyses on some of the differences between high scoring exams (those receiving a score of 4 or 5) and low scoring exams (those receiving a 1 or 2). I’ve put some of the results in the data visualization below. If you use the mouse to hover over the noun and verb bubbles, you will see their frequencies and their statistical significance.
AP Bubble Chart

Okay. So what are we looking at? The parts-of-speech I’ve put on the graph (the preposition of, pronouns, adjectives, articles, verbs, and nouns) are all unequally distributed in high and low scoring exams. In other words, some are more frequent in high scoring exams; some more frequent in lower scoring exams.

The x-axis shows the frequency of the part-of-speech  in high scoring exams, and the y-axis shows the frequency in low scoring exams. The red bubbles are features that distinguish low scoring exams. The blue bubbles are features that distinguish high scoring exams.

For example, nouns are more common in higher scoring exams than in lower scoring ones. By contrast, verbs are more common in lower scoring exams.

That significance is measured by a chi-square test. The greater the significance, the larger the bubble. But be aware that all of the features shown on this graph are statistically significant. So even though the noun bubble looks relatively small, nouns differentiate higher scoring exams from lower scoring ones.

Now that we have a handle on what the graph is showing, what does it mean? What does it tell us about what distinguishes writing that gets rewarded on AP exams from writing that doesn’t?

One really interesting pattern is how these features follow patterns of involved vs. information production that Biber has described. Basically, involved language is less formal and more conversation-like. It is about personal interaction and often related to a shared context. In that kind of discourse, we tend to be less precise; clauses tend to be shorter; and we use lots of pronouns: “That’s cool, am I right?

Information production is very different. It tends to be more carefully planned and, rather than promoting personal interaction, its purpose is to explain and analyze. This is what we often think of when we imagine academic prose: “This juxtaposition of radically different approaches in one book anticipated the coexistence of Naturalism and Realism in American literature of the latter half of the 19th century and the beginning of the 20th century.

Here is Biber’s list of features of the two dimensions. The red highlights are those features that are more common in lower scoring exams, and the blue the more common ones in the higher scoring exams:


The patterns are very clear. The lower scoring exams look much more like involved discourse and the high scoring ones more like informational discourse.

This is not entirely a surprise. Mary Schleppegrell, for example, has argued a number of times how important information management is to successful academic writing for students. What she means is that students need to control how they grammatically put together their ideas.

Let’s take a look at a sample sentence from a high scoring exam:


In the example, the head nouns are all in red. Information that is added before the noun (like the adjective simple) is in green. Information that is added after the noun (like the prepositional phrase of the cake) is in blue. The pronoun they and the predicative adjective simple are in orange.

Information is communicated through nouns, and is elaborated using adjectives, prepositional phrases, and other structures. This is why we see these features to a much greater degree in those higher scoring essays. In the example, the student is able to communicate a lot of information! That information connects specific textual features from the short story (capitalization and simple description) to the student’s reading of a character’s emotional state.

We can also begin to see why there might be more verbs in the lower scoring essays. The subjects and objects of their clauses are less elaborated. In other words, their clauses are shorter. So the frequency of verbs is much higher.

Another interesting result from our research is the relationship between essay length and score. This is an issue that has received some coverage particularly regarding the (soon to be retired) essay portion of the SAT. We found that there is a correlation between essay length and score (0.255), but it is not a particularly strong one. My guess is that this correlation arises not only because the more successful essays have more to say, but also because they say it in a more elaborated way.

If you are interested, you can see all of our data here.


13 Nov
November 13, 2013

Radiolab recently had an interesting interview with Daniel Engber. In it, Engber discusses the rise and fall of quicksand as a device in movies. (You can also read his Slate article on the same subject). His data shows a clear peak in the percentage of movies using quicksand in the 1960s (chart via Radiolab):


Engber suggests that the decline in quicksand in movies coincides with a decline in children’s fear of it. It has lost its allure as an image of terror. I was curious, then, if a similar pattern is evident in language. COCA shows the following distribution for quicksand over time:


The data seem to show an earlier peak (in the 1900s and 1910s), followed by a dip then a steady decline after 1940. One potentially complicating factor, here, is that quicksand is a relatively rare word, which could affect some of the fluctuations. So I thought this would be a good opportunity to try out BYU’s version of Google Book’s data, which Mark Davies discussed at this year’s Studies in the History of the English Language conference. The results are similar, but with a more clearly defined rise and fall:


As with the COCA distribution, the Google data suggest a peak in language use that precedes the peak in movies. The data also show a move toward metaphorization in recent decades. Early in the twentieth century, examples are typically like this one:

There was a ford directly opposite the cantonment, and another, more dangerous, and known to only a few, three miles farther up stream. Keeping well within the water’s edge, so as to thus completely obscure their trail, yet not daring to venture deep for fear of striking quicksand, the plainsman sent his pony struggling forward, until the dim outline of the bank at his right rendered him confident that they had attained the proper point for crossing.

Later, however, we find examples like this one:

Only their husbands believed the kids were their own. He missed his mother, too, whose quicksand love he’d wanted so badly to escape.

Or this one:

It may well be that in responding to recent Congressional language the N.E.A. has begun to have a chilling effect on art in the United States and it may be entering the quicksand of censorship.

Not that quicksand  as a physical entity disappears completely in more recent discourse:

Georgie was not popular in the swamp; in fact, the other members of the colony that inhabited this stretch of muck and quicksand, black water and scum, had banished him to the very fringe of the community.

While such examples exist, they are far fewer than the type in which some physical sensation is compared to moving through quicksand or something (like love or censorship) is compared to the ensnaring effects of quicksand, itself. Thus, the use of quicksand is not only declining, it is also undergoing a shift in meaning.

Which brings us back to our  earlier question: Why does quicksand peak earlier in language than in film? One possibility might be related to the relative durability of particular genres in different media. Adventure stories (like dime novels) had their greatest popularity in the early twentieth century. This genre appears to be an enthusiastic employer of quicksand as a conventional obstacle and threat, and the decline of the genre coincides with the decline of the word. How the rise of quicksand as a cinematic device relates to the rise of particular kinds of movies, I’m not sure, but the relationships among Engber’s data and the linguistic data pose some compelling questions.

Dropping the “I-Word”

04 Apr
April 4, 2013

The Associated Press announced that they are changing their style guide to drop the phrase “illegal immigrant” while retaining phrases like “illegal immigration” and “entering the country illegally.”

Immigrant advocates have been fighting for this change for a long time. Part of the thinking is that the phrase illegal immigrant is essentializing; it links the concept of illegality to the people themselves.

A quick look a COCA shows that in American discourse generally, immigrants are conventionally represented as illegal.

collocatefrequencymutual information
Collocates for lemmatized immigrant on COCA using a span of 4 to the left and right

It is telling, not only that the frequency of illegal is so much higher, but also that its Mutual Information value is so high. Mutual Information is a measure of association. In a corpus, words might have a high frequency of collocation because both words are themselves frequent. Less frequent words would have lower frequency but have a high degree of association: when they appear, they appear together. Mutual Information accounts for these variations and gives us a measure of association: How likely are these words to appear in the same neighborhood? The usual cut-off for significance is MI>3.

So illegal and immigrant (MI=8.30) have a very high degree of association. This is somewhat surprising given that illegal doesn’t seem like a particularly specialized modifier. Undocumented has a higher degree of association (MI=9.42). It’s use, however, is far more restricted. It appears only with nouns related to the movement of people across borders: workers, students, aliens, people, migrants, etc.

Another interesting feature of this debate is how recent the practice of representing immigrants as illegal is. Despite a long history of contentious discourse around immigration in the US, the results from COHA show that this specific construction is quite new.


For an analysis of representations of immigrants and immigration in Britain, see Gabrielatos and Baker (2008).

Identity and the Synthesized Voice

22 Dec
December 22, 2012

On the BBC’s “Click” podcast last week, they discussed evolving voice technologies, and their increasing ability to incorporate complex prosodic and phonological properties like emotional emphasis and accent. It’s an interesting conversation and worth listening to, but the segment that caught my attention touched on Stephen Hawking and the preservation of legacy technologies related to his voice:

      1. Stephen Hawking's Synthesized Voice

A 2001 Deborah Cameron article on “designer voices” provides some interesting context for Hawking’s choices regarding his synthesized voice:

Hawking faces an unusual choice here. It may seem obvious that a British accent would be more ‘authentic’ than the American one he has had to make do with up until now; this is a question of what linguists call ‘social indexicality’, the ability of the speaking voice to point to social characteristics like age, gender, class, ethnicity or, most saliently in this case, national origin. Yet voices are also privileged markers of individual identity. It would not be unreasonable for Hawking to take the view that his synthesised American voice essentially is his voice, in the sense that it is instantly recognisable as the voice of Stephen Hawking (indeed, so recognisable that Hawking can earn money doing advertising voice-overs). Some media reports noted that the scientist was having difficulty deciding whether to make the shift: he was quoted as saying that ‘it will bring a real identity crisis’.

There are a couple of issues here that I think are compelling. The first is one that both the podcast and Cameron note: Hawking’s choice speaks to the complex and powerful connection between our voice and our identity(s). In particular, it suggests that such a relationship can be an evolving one. In his early encounters with his synthesized voice, Hawking experiences a kind of distance from its American-ness. However, he ultimately chooses to retain it in the face of other possibilities. And there is no question how recognizable his voice is, inspiring a raft of amateur and professional parodies.

The other interesting issue is what that choice has meant technically. It has necessitated the preservation of legacy technologies in order maintain a powerful signifier of self. Which makes me wonder. As technologies continue to emerge that connect those technologies to our biologies, our identities and the signals we use to advertise those identities will change, too. Those inter-related changes, technological and indexical, may not move at the same pace, however. What might this mean? Could a voice, an identity, face a forced obsolescence?

Understanding Keyness

30 Nov
November 30, 2012

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called keyness.

Imagine two highly simplified corpora. Each contains only 3 different words (cat, dog, and cow) and has a total of 100 words. The frequency counts are as follows:

  • Corpus A: cat 52; dog 17; cow 31
  • Corpus B: cat 9; dog 40; cow 31

Cat and dog would be key, as they are distributed differently across the corpora, but cow would not as its distribution is the same. Put another way, cat and dog are distinguishing features of the corpora; cow is not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a chi-squared test ( or χ² test) and log-likelihood. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word lovely. Here is their data:

 lovelyAll other wordsTotal words
Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

 \Chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}

O is the observed frequency, and E is the expected frequency if the independent variable (in this case sex) had no effect on the distribution of the word. The \sum \!\, is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: observed minus expected squared, divided by expected for (1) lovely used by males; (2) other words used by males; (3) lovely used by females; and (4) other words used by females.

In other words, our main calculations are for the values in red; we have a 2×2 contingency table. The totals–the peripheral table values in green–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the row total (R) times the column total (C) divided by the number of words in the corpus (N).

The expected value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

 lovelyAll other wordsTotal words
Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC

Now we can finish our calculations. For each of our table cells, we need to subtract the expected frequency from the observed frequency; multiply that value by itself; then divide the result by the expected frequency. The calculations for each cell would look like this:

 lovelyAll other wordsTotal words
Males((414 - 647.91) * (414 - 647.91)) / 647.91 ((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.091714443
Females((1214 - 980.09) * (1214 - 980.09)) / 980.09((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.912593452
Table 3: Calculating the chi-square values for our frequencies

When we complete those calculations, our contingency table looks like this:

 lovelyAll other wordsTotal words
Table 4: Chi-square values for our frequencies

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

  • Pearson Chi-Square Calculator
  • Chi-Square

Now the question is: What does this number tell us? Typically, we determine the significance of keyness values in one of two ways. First, sometimes corpus linguists just look at the top key words (maybe the top 20) to explore the most marked differences between two corpora. Second, we can find the p-value.

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, df = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

Table 5: Chi-square distribution table showing critical p-values for df = 1

The critical cutoff point for statistical significance is usually at p<.01 (though it can also be p<.05). So a chi-square value above 6.63 would be considered significant. Our value is 140.3, so the distribution of lovely is highly significant (p<.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

I  want to leave you with a couple of tips, questions, and resources:

  1. Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness.
  2. Typically, our chi-square tests in corpus linguistics will involve a 2×2 contingency table (with a degree of freedom of one); however, this isn’t always the case. We might be interested in, for example, distributions of multiple spellings (e.g., because, cause, cuz, coz) that would involve higher degrees of freedom.
  3. Why would comparing larger corpora tend to produce results with larger chi-square (or keyness) values?
  4. Anatol Stefanowitsch has an excellent entry on chi-square tests here.
  5. There is a nice chi-square calculator here.

Normalizing Word Counts

16 Nov
November 16, 2012

One of the things we often do in corpus linguistics is to compare one corpus (or one part of a corpus) with another. Let’s imagine an example. We have 2 corpora: Corpus A and Corpus B. And we’re interested in the frequency of the word boondoggle. We find 18 occurrences in Corpus A and 47 occurrences in Corpus B. So we make the following chart:

 The problem here is that unless Corpus A and Corpus B are exactly the same size this chart is misleading. It doesn’t accurately reflect the relative frequencies in each corpus. In order to accurately compare corpora (or sub-corpora) of different sizes, we need to normalize their frequencies.

Let’s say Corpus A contains 821,273 words and Corpus B contains 4,337,846 words. Our raw frequencies then are:

Corpus A = 18 per 821,273 words

Corpus B = 47 per 4,337,846 words

To normalize, we want to calculate the frequencies for each per the same number of words. The convention is to calculate per 10,000 words for smaller corpora and per 1,000,000 for larger ones. The Corpus of Contemporary English, for example, uses per million calculations in the chart display for comparisons across text-types.

Calculating a normalized frequency is a fairly straightforward process. The equation can be represented in this way:

We have 18 occurrences per 821,273 words, which is the same as x (our normalized frequency) per 1,000,000 words. We can solve for x with simple cross multiplication:

Generalizing then (normalizing per one-million words):

You can use the calculator below to see how this works. Just input the relevant numbers without commas.

  • Normalizing Calculator (do not use comma separators)

For our example, we can see how this affects the representation of our data:

The raw frequencies seemed to suggest that boondoggle appeared more than 2.5 times more in Corpus B. The normalized frequencies, however, show that boondoggle is actually twice as frequent in Corpus A.

A Twitter Post

18 Oct
October 18, 2012

I came across an interesting use of Twitter as a modifier in an article in the New York Times. In describing a proposed development of some very small apartments in San Francisco, the article states:

Opponents of the legislation have even taken to derisively calling the micro-units “Twitter apartments.”

Of course the process by which brand names become common nouns or verbs is well-known: frisbee, xerox, band-aid, kleenex, hoover, google, etc. It is possible that this is the leading of edge of a similar semantic shift whereby Twitter becomes a modifier meaning “very small” or “tiny.” I tried to find similar uses of Twitter but had no success. I first tried checking common nouns that collocate with tiny (apartment by the way is the 8th most frequent collocate). Kitchen and house seem likely possibilities, but I have yet to find any constructions in which Twitter kitchen means tiny kitchen, for example. As of yet, this use of Twitter appears to be an isolated neologism.