Corpus of Presidential Speeches (CoPS) and a Clinton/Trump Corpus (Updated!)
- Resources Corpora
In this election season, both linguistics and the popular press have shown an interest in candidates’ speaking styles — particularly Donald Trump’s (examples here, here, here, and here). In light of that interest, I have compiled a couple of corpora for instructors and students to use.
The first is a historical corpus. It contains data scraped from archives of speeches made by U.S. Presidents from George Washington to Barack Obama. The total word count for the corpus is about 3.5 million words. The files are in folders organized by President, making it relatively easy to do comparisons of individuals, by party, or by period. The precise breakdown by President is included in the table below, and there is a link for you to download the corpus at the end of this post.
Num. | President | Term | Word Count |
---|---|---|---|
01 | George Washington | 1789-1797 | 31643 |
02 | John Adams | 1797-1801 | 14672 |
03 | Thomas Jefferson | 1801-1809 | 40149 |
04 | James Madison | 1809-1817 | 36049 |
05 | James Monroe | 1817-1825 | 49960 |
06 | John Quincy Adams | 1825-1829 | 36472 |
07 | Andrew Jackson | 1829-1837 | 157535 |
08 | Martin Van Buren | 1837-1841 | 64747 |
09 | William Henry Harrison | 1841 | 8465 |
10 | John Tyler | 1841-1845 | 69471 |
11 | James K. Polk | 1845-1849 | 104267 |
12 | Zachary Taylor | 1849-1850 | 11368 |
13 | Millard Fillmore | 1850-1853 | 39392 |
14 | Franklin Pierce | 1853-1857 | 63448 |
15 | James Buchanan | 1857-1861 | 80883 |
16 | Abraham Lincoln | 1861-1865 | 95643 |
17 | Andrew Johnson | 1865-1869 | 98806 |
18 | Ulysses S. Grant | 1869-1877 | 103060 |
19 | Rutherford B. Hayes | 1877-1881 | 67474 |
20 | James A. Garfield | 1881 | 2980 |
21 | Chester Arthur | 1881-1885 | 49590 |
22, 24 | Grover Cleveland | 1885-1889, 1893-1897 | 155553 |
23 | Benjamin Harrison | 1889-1893 | 76363 |
25 | William McKinley | 1897-1901 | 92318 |
26 | Theodore Roosevelt | 1901-1909 | 196692 |
27 | William Howard Taft | 1909-1913 | 117594 |
28 | Woodrow Wilson | 1913-1921 | 80123 |
29 | Warren G. Harding | 1921-1923 | 28752 |
30 | Calvin Coolidge | 1923-1929 | 74333 |
31 | Herbert Hoover | 1929-1933 | 87888 |
32 | Franklin D. Roosevelt | 1933-1945 | 132082 |
33 | Harry S. Truman | 1945-1953 | 36954 |
34 | Dwight D. Eisenhower | 1953-1961 | 18097 |
35 | John F. Kennedy | 1961-1963 | 136196 |
36 | Lyndon B. Johnson | 1963-1969 | 231949 |
37 | Richard Nixon | 1969-1974 | 66482 |
38 | Gerald Ford | 1974-1977 | 40446 |
39 | Jimmy Carter | 1977-1981 | 70388 |
40 | Ronald Reagan | 1981-1989 | 196553 |
41 | George Bush | 1989-1993 | 71160 |
42 | Bill Clinton | 1993-2001 | 144580 |
43 | George W. Bush | 2001-2009 | 107737 |
44 | Barack Obama | 2009-2016 | 199211 |
Total | 3587525 |
If you are planning on using this corpus, take note of a few things. First, each file has a heading, like this one:
<title=”State of the Union Address”>
<date=”January 23, 1979″>
As in the heading, angle brackets are used to isolate the speech of the president named in each file. For example, this comes from a debate between George Bush and Michael Dukakis and moderated by Jim Lehrer:
<LEHRER: Mr. Vice President, a rebuttal.>
<BUSH:> Well, I don’t question his passion. I question — and I don’t question his concern about the war in Vietnam.
This sort of “tagging” was accomplished by a simple script and may not be 100% accurate. So be sure to continually check your data. It also means that if you are using a concordancer like AntConc, you will need to set the “Tags” option under “Global Settings” to “Hide tags.”
I did this so that the context of some of these communicative events is preserved. But don’t forget to set your tag option, otherwise your data will be quite skewed.
It is also important to be aware that this corpus is fine for learning purposes and for exploratory analysis. However, if you plan on doing more statistically rigorous work, you will need to account for a few things. First, as with all historical corpora, the earlier stuff comes from written texts. (We, of course, don’t have any recordings of George Washington.) But for the later Presidents, the corpus includes things like transcripts from debates and other more interactional occasions. So if you want to do some comparisons from different time periods, you’ll want to think through how register might influence (or not) the results.
Also, I assembled this corpus quickly. If you look through the table above, you’ll see that there are differing numbers of words for different Presidents. Some of this is a byproduct of how long Presidents served in office. (William Henry Harrison served just a month, and Franklin Roosevelt served just a little over 12 years.) However, some Presidents just had more data available than others, irrespective of time. I have not attempted to compensate for these imbalances. Again, if you want to produce more statistically rigorous work, you may need to address this issue, depending upon what exactly you’re looking at.
The second corpus I’m releasing is a collection of speeches delivered at campaign events by Hillary Clinton and Donald Trump, beginning with their acceptance speeches at their respective party conventions and continuing up to the election. As with CoPS, this one uses angle brackets in headers and as tags, so be sure to set your preferences accordingly.
The corpus contains approximately from 114, 000 words from Clinton and 440,000 words from Trump. In addition to these files, I have posted the data on a GitHub repository (here). That repository also contains R code that calculates both log-likelihood (keyness) values and effect sizes (by log ratio), then compiles the results into a table that can be downloaded. You can see the results from the Clinton-Trump here.
To cite:
Brown, D. W. (2017) Clinton-Trump Corpus. Retrieved from http://www.thegrammarlab.com
Brown, D. W. (2016) Corpus of Presidential Speeches. Retrieved from http://www.thegrammarlab.com
Good luck and enjoy.