Workshop: Introduction to Text Analysis with R
- Resources Workshops
The following resources are for my workshop at Wake Forest University. If you are planning to attend the workshop, please read the introductory materials and follow the instructions for installing R and downloading the materials (data and code in a zip file) that we will be using.
About the materials
The materials and instructions that follow are for (what I hope will be) a very gentle introduction to text analysis using R. They are meant to help those with no (really!) background in coding generally and R specifically.
If you do have some experience coding, there will be sections (especially if you accessing this on your own) that you’ll find pretty basic. But if you’re new to coding or have very little experience, I want these materials to help you understand R’s syntax, so that you’ll have a fundamental grasp of how to read and adapt snippets of code and, thus, encourage you strike out on your own.
Also, we have a very limited time for the workshop, so I’m not sure how far we’ll get with the 3 scripts that I’ve written for the occasion. But even if we don’t get through them all, you should have enough expertise and confidence when we’re done to run them on your own regardless.
Before getting into some specifics about using R, I want to briefly address the issue of why we would bother using it in the first place. After all, if one is unfamiliar with coding, learning even the basics presents an investment of time and effort, which can be a barrier to entry. So going in, we might want to know what the payoffs are going to be.
For many tasks, an off-the-shelf concordancer may be all that you need. Laurence Anthony and his collaborators have generously made AntConc and other corpus tools freely available. Moreover, those tools have continued to be upgraded — not only keeping pace with changes to operating systems, but also adding more robust statistical operations like effect sizes.
I still often use AntConc to generate word lists, keyword lists, ngrams, and collocations when I’m working with a relatively small corpus and want to explore it using those well-established techniques. That said, off-the-shelf software is limited in the analytical tools it provides and the amount of data it can process.
Though R, too, is limited by the power of your computer, it’s ceiling is far higher. (I have processed corpora with far more than a billion words on my desktop without resorting to parallel processing or elastic computing; both of which you can do, by the way.)
Moreover, R allows you to process and analyze your data in any way that you can imagine, affording the researcher extraordinary flexibility. R is very good at pushing around data. Thus, sub-setting and transforming your data within the R environment relatively easy. This eliminates the need to manipulate data in programs like Excel or Google Sheets.
Finally, R gives you control over your graphical outputs. The most widely used graphics package is ggplot2, which enables a wide range of visualizations and fine-grained options for manipulating the details of those visualizations.
With R, then, we can go from a bunch of text files sitting in a folder on our desktop to a beautiful chart representing a pattern or finding we’ve discovered in a single script. In my opinion, it’s unmatched in its power and flexibility.
Tools of the trade
I’m going to give you some instructions about setting up R for the workshop, but I’d like to point out a couple of additional tools that (if you’re not already using them) you should be familiar with.
- A text editor
Whenever you are opening, saving, or preparing files to processed either by a concordancer (like AntConc) or a programming environment (like R), you should NEVER do so in Word. Microsoft products like Word and Excel want to do Word-y and Excel-y things to the files, which can make them inhospitable to text processing. That processing often requires *.txt files with specific kinds of encoding. (For more information on *.txt files, see here and for more information on text encoding, see here.)
- Regular expressions
Regular expressions are ways of finding and replacing patterns in text. Rather than searching for an exact text string or word, we can search for complex patterns. For example, we can search for classes of things like a word character (using the symbol combination \w), a non-word character (\W), a number (\d), or all punctuation marks ([[:punct:]]). Combining these allows us to manipulate texts in useful ways. We often use regular expressions in R. But they can also be used in TextWrangler/BBetit, Sublime Text, and AntConc.
- Stack Overflow
Stack Overflow is a community site where coding questions get asked and answered. It is the place I go to when I need to look up how to do something like sort a data frame by multiple columns. (Don’t worry if you don’t know what these terms mean right now.) Just use your Google skills (search something like: sort data frame multiple columns r), and scan the results for something from stackoverflow.com (like this). You’ll generally find multiple answers, and one with a big green check-mark next to it. That is the answer that has been voted the best (though there are usually other good answers, too). If you understand basic R syntax, you can make one or two adjustments to the relevant lines of code, and you’re off to the races. I can’t recall a single question that I’ve had for which I couldn’t find a posted response. If you have a question, chances are someone’s had the same one. And the coding community is very generous with its knowledge. It is really through this process that I (and many, many others) have learned to code. Another good resource is the Cookbook for R.
Vignettes are walk-throughs of packages (more on what these are in a moment). Sometimes they’re websites. Sometimes they’re PDFs. And they take you through the functionalities of a package by introducing some sample tasks and using some sample data. In addition to the documentation that is available for every R package (in PDF form), vignettes are very helpful in providing model code that can be adapted for your own data, as well as for understanding a package’s potential applications.
To do before the workshop
Before the workshop, you’ll need to get R loaded and ready on your computer and download a zip file containing some code and data that we’ll be using. You can do all of these without any code — just a little typing and clicking. It should take you no more than 15 minutes or so. Follow these steps.
- Download and install R (following this link) for your operating system.
- Download and install the free version of R Studio (following this link) for your operating system.
- Download the wf_workshop zip file.
- Unzip the wf_workshop file (by double clicking for a Mac, by right clicking and extracting all for Windows).
- Place in unzipped folder in your “Documents” folder. Don’t rename anything or create any new folders. Just drag it to your “Documents”. I’ll explain why shortly.
- Open R Studio. (If you have a Mac, you’ll need to allow the application to open, as it hasn’t been installed through the App Store. Depending on your security settings this may simply mean clicking a button that appears after a warning or opening the “Security and Privacy” settings in your System Preferences.)
It should look something like this:
The bottom right quadrant has a tab showing a list of the packages that are installed in your library. Packages are collections pre-compiled functions that we can install in our library and call when we want to use them. The code that they contain is often complex. But the wonderful thing about the R community is that others have taken care of that for us. All we need to do is install them (and understand how to use them). There are a bunch that folks have written for text processing and corpus analysis. Among them:
We’re going to be using quanteda, as well as a couple of other packages, which you’ll need to install. Again, no code is required.
- Click on the install button on the “Packages” tab (circled in red in the above image). A dialogue box will open like this:
- Start typing “quanteda” and the package installer will autofill. Select “quanteda” and click install. The installer will go to work. In your workspace, you will see R going through its routine. Just wait for it to complete. You’ll know when the greater-than symbol appears on the left with the cursor blinking next to it. Something like this:
- Install the “readtext” package, repeating the process described in 7 & 8
- Install the “syuzhet” package, repeating the process described in 7 & 8.
- Install the “ggplot2” package, repeating the process described in 7 & 8.
You should be all ready for the workshop, having done the following:
- Installed R and R Studio
- Installed the packages quanteda, readtext, syuzhet, and ggplot2
- Downloaded the wh_workshop files, unzipped them, and placed them in your “Documents” folder.
I am now going to go over some basics of R scripts. If you’ve never coded before, it might be helpful to read before we meet. But it’s not necessary.
Basics of R scripts
A script is just a series of commands that we write and execute. When we’re working with text, they take us through loading our data, processing it, running our statistical analyses, creating our visualizations, and saving our outputs. They can be just a few lines of code or many.
In our workshop, we’ll be working with 3 (all in the “code” folder):
If you’ve never written (or maybe even closely looked at code), it’s helpful to know a few basics.
- You’re not going to break anything.
For the initiated, running code can sometimes seem magical — and, by implication, having unforeseen consequences. Don’t worry. If you get some sort of error message, find the problem, fix it, and try again. The only times you need to be really careful is when you are moving, deleting, or writing files. What you delete through R doesn’t go in the Trash; it is gone. Same if you over-write a file (which is why you need to be careful what you name things and where you put them when you save them).
We’re not going to be doing any of those things (aside from saving a couple of tables and images). So if any of this makes you anxious, you can relax.
- There are many ways of accomplishing the same thing.
Whatever our goal with a piece of code, there are generally many ways of getting there. Some solutions are elegant. Some are faster and more efficient. I generally opt for clarity when I code. I reuse and re-purpose bits of code all the time, and I like it to be clear (at least to me) what the code is doing at each step.
I generally don’t worry too much about speed (and we won’t with the small data sets we’re using in the workshop). The exception, of course, is when I work with a very large corpus (hundreds of millions or billions of words). If you have plans to work with that amount of data eventually, you can think about benchmarking as you develop your expertise.
- Start small even if your eventual goals are big.
When executing code for the first time (whether code you’re written yourself or borrowed from someone else), test it on a small subset of your data. Copy some data into a test folder that you can operate on until you’ve got your code perfected. Then, you can point it to the full data set. If you’re data set is small to begin with, of course, this doesn’t much matter. But as your code gets more complex and the data larger, you save a lot of time by testing it first.
On a related note, always keep a copy of your original data separate from the data you’re processing. I usually have a project folder with one copy and an archive/repository folder with the original. That way, you don’t have to be concerned if you make changes to the data that you later realize are not what you want.
- Opening an R Script.
To open a script, simply go to the “File” menu, select “Open File” and navigate to script you want to open. To start a new script, go the “File” menu and select “New File” and “R Script”:
Your workspace will look something like this:
Now you can start writing code (at the place circled in red).
- Syntax basics.
Let’s start with a simple example.
c("a", "b", "c")
The function that we’re using is c — which is a method for combining things. In R, functions take arguments, which are the elements in parentheses. Some functions take a single argument, some many. In addition to those that are required, in some cases you can have optional arguments, too. (Some basic functions are listed here and here.) The function c will take as many objects as we put in and concatenate them.
Arguments are separated by commas, and spaces don’t matter. Neither do returns. Sometimes, with elaborate arguments, you’ll see them written with returns after the commas, like this:
c("a", "b", "c")
This kind of formatting is just to make the code more legible to humans. What matters to the computer is that we have the correct arguments in the correct order, that they are all separated by commas, and that for every opening parenthesis (or bracket or square bracket) we have a closing one. (Also, note that the comma comes after the quotation mark.)
Here is another simple example:
sum(1, 2, 3)
This function is even more obvious. It sums the values that appear in its arguments. Notice that for numeric values, there are no quotation marks. Quotation marks appear around character values.
- Executing code in R Studio.
Executing code is easy. Let’s take this last example. We can copy that snippet of code into our script in R Studio. Then, we highlight or select that line of code and click the “Run” button:
Doing so results in this output in the console (bottom left):
We can execute one line or many (whatever we select). Just be sure to select all of what you want to run. If we wanted to execute the second function above (with the returns after the commas), we would need to highlight all 3 lines including the closing parenthesis.
- Creating data objects and doing stuff with them.