In this election season, both linguistics and the popular press have shown an interest in candidates’ speaking styles — particularly Donald Trump’s (examples here, here, here, and here). In light of that interest, I have compiled a couple of corpora for instructors and students to use.
The first is a historical corpus. It contains data scraped from archives of speeches made by U.S. Presidents from George Washington to Barack Obama. The total word count for the corpus is about 3.5 million words. The files are in folders organized by President, making it relatively easy to do comparisons of individuals, by party, or by period. The precise breakdown by President is included in the table below, and there is a link for you to download the corpus at the end of this post.
|06||John Quincy Adams||1825-1829||36472|
|08||Martin Van Buren||1837-1841||64747|
|09||William Henry Harrison||1841||8465|
|11||James K. Polk||1845-1849||104267|
|18||Ulysses S. Grant||1869-1877||103060|
|19||Rutherford B. Hayes||1877-1881||67474|
|20||James A. Garfield||1881||2980|
|22, 24||Grover Cleveland||1885-1889, 1893-1897||155553|
|27||William Howard Taft||1909-1913||117594|
|29||Warren G. Harding||1921-1923||28752|
|32||Franklin D. Roosevelt||1933-1945||132082|
|33||Harry S. Truman||1945-1953||36954|
|34||Dwight D. Eisenhower||1953-1961||18097|
|35||John F. Kennedy||1961-1963||136196|
|36||Lyndon B. Johnson||1963-1969||231949|
|43||George W. Bush||2001-2009||107737|
If you are planning on using this corpus, take note of a few things. First, each file has a heading, like this one:
<title=”State of the Union Address”>
<date=”January 23, 1979″>
As in the heading, angle brackets are used to isolate the speech of the president named in each file. For example, this comes from a debate between George Bush and Michael Dukakis and moderated by Jim Lehrer:
<LEHRER: Mr. Vice President, a rebuttal.>
<BUSH:> Well, I don’t question his passion. I question — and I don’t question his concern about the war in Vietnam.
This sort of “tagging” was accomplished by a simple script and may not be 100% accurate. So be sure to continually check your data. It also means that if you are using a concordancer like AntConc, you will need to set the “Tags” option under “Global Settings” to “Hide tags.”
I did this so that the context of some of these communicative events is preserved. But don’t forget to set your tag option, otherwise your data will be quite skewed.
It is also important to be aware that this corpus is fine for learning purposes and for exploratory analysis. However, if you plan on doing more statistically rigorous work, you will need to account for a few things. First, as with all historical corpora, the earlier stuff comes from written texts. (We, of course, don’t have any recordings of George Washington.) But for the later Presidents, the corpus includes things like transcripts from debates and other more interactional occasions. So if you want to do some comparisons from different time periods, you’ll want to think through how register might influence (or not) the results.
Also, I assembled this corpus quickly. If you look through the table above, you’ll see that there are differing numbers of words for different Presidents. Some of this is a byproduct of how long Presidents served in office. (William Henry Harrison served just a month, and Franklin Roosevelt served just a little over 12 years.) However, some Presidents just had more data available than others, irrespective of time. I have not attempted to compensate for these imbalances. Again, if you want to produce more statistically rigorous work, you may need to address this issue, depending upon what exactly you’re looking at.
The second corpus I’m releasing is a collection of speeches delivered at campaign events by Hillary Clinton and Donald Trump, beginning with their acceptance speeches at their respective party conventions. As with CoPS, this one uses angle brackets in headers and as tags, so be sure to set your preferences accordingly.
Because not much time has elapsed since the conventions, the corpus is still relatively small (about 30,000 words from Clinton and 125,000 words from Trump). But check back, because I will be updating the corpus between now and the election.
Good luck and enjoy.