- Resources Project Guides
This is a walk-through to introduce to a couple of AntConc’s basic functions: word lists, keyword lists, and concordances. Bear in mind AntConc can do a lot more that what is shown here, but what follows is intended to get your started. (I would recommend Laurence Anthony’s tutorials, as well.)
AntConc is a concordancer — a tool that helps us to process corpora. It was developed by Laurence Anthony (hence Ant[thony] Conc[ordancer]). Another popular alternative is Mike Smith’s WordSmith, which is more powerful, but also requires a license.
So before we begin, you’ll need to download and install the version of AntConc that is appropriate for your operating system. Unzip the application and move it to your Applications folder. Now you can open it. (Note that if you have a Mac, the first time you do this it will likely prompt a warning. You will need to go your Security & Privacy settings in your System Preferences and authorize the application.)
Now we’ll need some data. For this exercise we’re going to use a corpus of Shakespeare’s plays complied by Mike Scott. You can download it here. After you download it, you can unzip it. (On a Mac just double click it; on a PC right-click and choose Extract files.)
The first thing we need to do is to load what is called our “target corpus.” The target corpus is simply the collection of texts that we want to analyze — the target of our investigation.
In AntConc, we can load files one at a time. Or we can point AntConc to a folder and tell it to load whatever text files are in there (potentially saving us a lot of time if our corpus has a lot of individual files).
So that’s where we need to start.
1. Go to the file menu and Open Dir…
2. Next, navigate to the folder called comedies and select it.
You will notice that the list of files (in this case comedies) shows up under Corpus Files on the left.
Now we need to load in our “reference corpus.” When we do corpus analysis, one common technique is to compare on corpus to another. Those comparisons produce what are called keywords. Keywords are words that are statistically more common in one set of texts (our target corpus) when compared to another (our reference corpus). So the next step is to load in our reference corpus so we can produce a keyword list.
3. Go to Tool Preferences (under Settings for the Mac version) and select Keyword List on the left.
4. Select Add Directory navigate to the folder called historical and select it.
5. Hit the Load button. When that process completes, click Apply.
Now one final step before we start to analyze our data. If we were to open one of our corpus files, it would be formatted like this:
Now, fair Hippolyta, our nuptial hour
Draws on apace: four happy days bring in
Another moon; but O! methinks how slow
This old moon wanes; she lingers my desires,
Like to a step dame, or a dowager
Long withering out a young man’s revenue.
Note the mark-up between the angle brackets. We want to tell AntConc to ignore that mark-up, because we don’t want any of those words to affect our overall results. We want to focus on the dialogue, rather than character names, stage directions, etc.
6. Go to Global Settings. Under Tags select Hide Tags and then click Apply.
Now we’re in business. Whatever we do in AntConc, we always have to start with a word list in order to process the texts into the concordancer.
7. Go to the Word List tab and click Start.
You might notice a few things about a basic word list. First, it’s difficult to recognize anything distinctive about a basic word list (unless you’ve seen a lot of word lists). Most of the very common words like pronouns (e.g., I, you, me), determiners (e.g., the), coordinators (e.g., and) and prepositions (e.g., to, in) appear at or near the top of most any word list we generate.
Second, we have some numbers. The rank tells the order: the is the most frequent word, i (note all tokens are converted to lower case) the second, etc. The frequency (Freq) is a count of the occurrences of that token. So the occurs 10,898 times in the corpus. In order to make any sense of that number we would probably want to normalize it. To do that we divide the frequency by the total number of words in the corpus (Word Tokens), which in this case is 357,281. Then we multiply that by a normalizing factor like one million. So the occurs 30,502.60 per million words.
Another interesting number is the Word Types. This is the number of different words in the corpus. We can divide that number by the total number of tokens to get a type/token ratio (TTR). That number gives us a rough idea of the diversity of vocabulary (or lexical variation) in a corpus. And I emphasize rough, because TTRs are affected by the lengths of texts that we’re dealing with. Shorter texts will have higher TTRs than longer ones. So if we wanted to compare texts of varying lengths, a TTR wouldn’t be particularly helpful, and we’d need to get fancier with our calculations (with something like a mean segmental type-token ratio).
Okay, now lets generate a keyword list.
8. Click on the Keyword List tab and click Start.
The keyness value (for example 919.51 for you) tells us the evidence we have for a particular effect (though not the size of that effect). By default, these are calculated in AntConc by log-likelihood. The threshold for significance is conventionally at LL = 6.63 (and sometimes at LL = 3.84). So tokens with keyness values above that threshold would be considered significant. The threshold of 6.63 corresponds the point at which the p-value is less that 0.01. (If you’ve never heard of a p-value before, this is a good introduction). In short, we have strong evidence that the difference between our observed frequency (in the comedies) and what we would expect (given a dataset that also includes the historical plays) is significant.
Okay, but what do we do with that information? This is when we need to ask ourselves if there are any recognizable patterns. For example, it looks like a number of pronouns have high keyness values. We also have a number of tokens that seem to have thematic implications like love and gods. The next question we need to ask is: Why? What is it about the thematic focus of these comedies or the structure and function of their dialogue that would potentially explain these differences?
To begin positing an explanation, we can click on a keyword, which takes us to the Concordance tab.
Note that at the bottom you have a sorting option. Those drop down lists determine how we want our lines to be sorted. I selected love. And because love is a verb, I was interested in the subject of the verb (who is doing the loving) and the object of the verb (who is being loved). So I sorted by list 1L (one to the left) followed by 1R (one to the right) and 2R (two to the right). Then, I just clicked the Sort button.
From here, we can try to make some sense out of these patterns. For example, there are quite a few first person declarations of love directed at thee. And there are surely other patterns here, too. To explore this further, we might use the Clusters/N-Grams function to look for common groupings of words (like I love thee — a trigram or three word cluster), or we could use the Collocates function to see which words collocate with (or appear around) love.
There are a couple of time saving features that I want to point out before concluding here. One is that you can easily switch your target and your reference corpus.
If you go back the Keyword List options under Tool Preferences. You can switch corpora.
9. Select Swap with Target Files. Then, click Clear followed by Load. Finally, hit Apply.
Now you simply start the process again. First, generate a word list, then a keyword list. The latter will show you what tokens are significantly more frequent in the historical plays (which are now our target corpus) when compared to the comedies (which is our reference corpus).
Finally, be aware that anything you generate in AntConc can easily be saved to a text files, whether that’s a keyword list, concordance lines, whatever. Just go to the File menu and Save Output to Text File.