Workshop: Introduction to Text Analysis with R
- Resources Workshops
If you are planning on attending the workshop at Wake Forest University, follow the 11 steps under the “To do before the workshop” heading prior to our meeting. The steps walk you through a few simple installation instructions and the downloading of the files we will be using.
For this process (and for the workshop) you don’t need to know anything about coding or analyzing text. It should only take about 15-30 minutes.
In addition to the set-up process, I’m including sections covering:
- Why R
- Tools of the trade
- Basics of R scripts
- Workshop overview
- Walk-throughs of the the 3 scripts we will be using
These are to help anyone who hasn’t coded before, who can’t attend the workshop in person, or who wants some reminders of what we covered. You may also want to look over the material in preparation for the workshop, though that’s not necessary.
The goal of the workshop is to introduce you to some of the basic workings of R, as well as to some of the possibilities that R offers the would-be text analyst. Using some sample data sets that include both academic writing and fiction, we will learn how to import and export text-as-data, how to use that data to generate counts and make comparisons, and ultimately how to export our results either as tables or as various kinds of plots. In addition to foundational corpus techniques like keywords, the workshop will touch on sentiment analysis and cluster analysis.
Given the time constraints, we may not make through all of the material. But the hope is that you will come out of the workshop equipped to proceed through the rest at your own pace, and more importantly to explore R for your own purposes and to develop your own expertise.
To do before the workshop
Before the workshop, you’ll need to get R loaded and ready on your computer and download a zip file containing some code and data that we’ll be using. You can do all of these without any code — just a little typing and clicking. It should take you no more than 15 minutes or so. Follow these steps.
- Download and install R (following this link) for your operating system.
- Download and install the free version of R Studio (following this link) for your operating system.
- Download the wf_workshop zip file.
- Unzip the wf_workshop file (by double clicking for a Mac, by right clicking and extracting all for Windows).
- Place in unzipped folder in your “Documents” folder. Don’t rename anything or create any new folders. Just drag the “wh_workshop” folder to your “Documents“. It should look like this:
Note that “Extract All…” in Windows sometimes creates a duplicate folder. Move only the one containing the “code”, “data”, and “output” folders.
- Open R Studio. (If you have a Mac, you’ll need to allow the application to open, as it hasn’t been installed through the App Store. Depending on your security settings this may simply mean clicking a button that appears after a warning or opening the “Security and Privacy” settings in your System Preferences.)
It should look something like this:
The bottom right quadrant has a tab showing a list of the packages that are installed in your library. Packages are collections pre-compiled functions that we can install in our library and call when we want to use them. The code that they contain is often complex. But the wonderful thing about the R community is that others have taken care of that for us. All we need to do is install them (and understand how to use them). There are a variety of packages for text processing and corpus analysis. Among them:
-
- tm (blog post here)
- korpus (vignette here)
- quanteda (documentation here)
We’re going to be using quanteda, as well as a couple of other packages, which you’ll need to install. Again, no code is required.
- Click on the install button on the “Packages” tab (circled in red in the above image). A dialogue box will open like this:
- Start typing “quanteda” and the package installer will autofill. Select “quanteda” and click install. The installer will go to work. In your workspace, you will see R going through its routine. Just wait for it to complete. You’ll know when the greater-than symbol appears on the left with the cursor blinking next to it. Something like this:
- Install the “readtext” package, repeating the process described in 7 & 8.
- Install the “syuzhet” package, repeating the process described in 7 & 8.
- Install the “ggplot2” package, repeating the process described in 7 & 8.
You should be all ready for the workshop, having done the following:
- Installed R and R Studio
- Installed the packages quanteda, readtext, syuzhet, and ggplot2
- Downloaded the wh_workshop files, unzipped them, and placed them in your “Documents” folder.
I am now going to go over some basics of R scripts. If you’ve never coded before, it might be helpful to read before we meet. But it’s not necessary.
Why R?
Before getting into some specifics about using R, I want to briefly address the issue of why we would bother using it in the first place. After all, if one is unfamiliar with coding, learning even the basics presents an investment of time and effort, which can be a barrier to entry. So going in, we might want to know what the payoffs are going to be.
For many tasks, an off-the-shelf concordancer may be all that you need. Laurence Anthony and his collaborators have generously made AntConc and other corpus tools freely available. Moreover, those tools have continued to be upgraded — not only keeping pace with changes to operating systems, but also adding more robust statistical operations like effect sizes.
I still often use AntConc to generate word lists, keyword lists, ngrams, and collocations when I’m working with a relatively small corpus and want to explore it using those well-established techniques. That said, off-the-shelf software is limited in the analytical tools it provides and the amount of data it can process.
Though R, too, is limited by the power of your computer, it’s ceiling is far higher. (I have processed corpora with far more than a billion words on my desktop without resorting to parallel processing or elastic computing — both of which you can do, by the way.)
Moreover, R allows you to process and analyze your data in any way that you can imagine, affording the researcher extraordinary flexibility. R is very good at pushing around data. Thus, sub-setting and transforming your data within the R environment relatively easy. This eliminates the need to manipulate data in programs like Excel or Google Sheets.
Finally, R gives you control over your graphical outputs. The most widely used graphics package is ggplot2, which enables a wide range of visualizations and fine-grained options for manipulating the details of those visualizations.
With R, then, we can go from a bunch of text files sitting in a folder on our desktop to a beautiful chart representing a pattern or finding we’ve discovered in a single script. In my opinion, it’s unmatched in its power and flexibility.
Tools of the trade
If you are interested in experimenting with R for text/corpus analysis, there are couple of additional tools that (if you’re not already using them) you should be familiar with.
– A text editor
- For Mac users, either TextWrangler or BBedit (both of which are free through the App Store)
- For Windows users, Sublime Text
Whenever you are opening, saving, or preparing files to processed either by a concordancer (like AntConc) or a programming environment (like R), you should NEVER do so in Word. Microsoft products like Word and Excel want to do Word-y and Excel-y things to the files, which can make them inhospitable to text processing. That processing often requires *.txt files with specific kinds of encoding. (For more information on *.txt files, see here and for more information on text encoding, see here.)
– Regular expressions
Regular expressions are ways of finding and replacing patterns in text. Rather than searching for an exact text string or word, we can search for complex patterns. For example, we can search for classes of things like a word character (using the symbol combination \w), a non-word character (\W), a number (\d), or all punctuation marks ([[:punct:]]). Combining these allows us to manipulate texts in useful ways. We often use regular expressions in R. But they can also be used in TextWrangler/BBetit, Sublime Text, and AntConc.
– Stack Overflow
Stack Overflow is a community site where coding questions get asked and answered. It is the place I go to when I need to look up how to do something like sort a data frame by multiple columns. (Don’t worry if you don’t know what these terms mean right now.) Just use your Google skills (search something like: sort data frame multiple columns r), and scan the results for something from stackoverflow.com (like this). You’ll generally find multiple answers, and one with a big green check-mark next to it. That is the answer that has been voted the best (though there are usually other good answers, too). If you understand basic R syntax, you can make one or two adjustments to the relevant lines of code, and you’re off to the races. I can’t recall a single question that I’ve had for which I couldn’t find a posted response. If you have a question, chances are someone’s had the same one. And the coding community is very generous with its knowledge. It is really through this process that I (and many, many others) have learned to code. Another good resource is the Cookbook for R.
– Vignettes
Vignettes are walk-throughs of packages (more on what these are in a moment). Sometimes they’re websites. Sometimes they’re PDFs. And they take you through the functionalities of a package by introducing some sample tasks and using some sample data. In addition to the documentation that is available for every R package (in PDF form), vignettes are very helpful in providing model code that can be adapted for your own data, as well as for understanding a package’s potential applications.
Basics of R scripts
A script is just a series of commands that we write and execute. When we’re working with text, they take us through loading our data, processing it, running our statistical analyses, creating our visualizations, and saving our outputs. They can be just a few lines of code or many.
In our workshop, we’ll be working with 3 (all in the “code” folder):
- text_intro.R
- corpus_intro.R
- sentiment_intro.R
If you’ve never written (or maybe even closely looked at code), it’s helpful to know a few basics.
– You’re not going to break anything.
For the uninitiated, running code can sometimes seem magical — and, by implication, having unforeseen consequences. Don’t worry. If you get some sort of error message, find the problem, fix it, and try again. The only times you need to be really careful are when you are moving, deleting, or writing files. What you delete through R doesn’t go in the Trash; it is gone. Same if you over-write a file (which is why you need to be careful what you name things and where you put them when you save them).
We’re not going to be doing any of those things (aside from saving a couple of tables and images). So if any of this makes you anxious, you can relax.
– There are many ways of accomplishing the same thing.
Whatever our goal with a piece of code, there are generally many ways of getting there. Some solutions are elegant. Some are faster and more efficient. I generally opt for clarity when I code. I reuse and re-purpose bits of code all the time, and I like it to be clear (at least to me) what the code is doing at each step.
I generally don’t worry too much about speed (and we won’t with the small data sets we’re using in the workshop). The exception, of course, is when I work with a very large corpus (hundreds of millions or billions of words). If you have plans to work with that amount of data eventually, you can think about benchmarking as you develop your expertise.
– Start small even if your eventual goals are big.
When executing code for the first time (whether code you’re written yourself or borrowed from someone else), test it on a small subset of your data. Copy some data into a test folder that you can operate on until you’ve got your code perfected. Then, you can point it to the full data set. If you’re data set is small to begin with, of course, this doesn’t much matter. But as your code gets more complex and the data larger, you save a lot of time by testing it first.
On a related note, always keep a copy of your original data separate from the data you’re processing. I usually have a project folder with one copy and an archive/repository folder with the original. That way, you don’t have to be concerned if you make changes to the data that you later realize are not what you want.
– Opening an R Script.
To open a script, simply go to the “File” menu, select “Open File” and navigate to script you want to open. To start a new script, go the “File” menu and select “New File” and “R Script”:
Your workspace will look something like this:
Now you can start writing code (at the place circled in red).
– Syntax basics.
Let’s start with a simple example.
c("a", "b", "c")
The function that we’re using is c — which is a method for combining things. In R, functions take arguments, which are the elements in parentheses. Some functions take a single argument, some many. In addition to those that are required, in some cases you can have optional arguments, too. (Some basic functions are listed here and here.) The function c will take as many objects as we put in and concatenate them.
Arguments are separated by commas, and spaces don’t matter. Neither do returns. Sometimes, with elaborate arguments, you’ll see them written with returns after the commas, like this:
c("a", "b", "c")
This kind of formatting is just to make the code more legible to humans. What matters to the computer is that we have the correct arguments in the correct order, that they are all separated by commas, and that for every opening parenthesis (or bracket or square bracket) we have a closing one. (Also, note that the comma comes after the quotation mark.)
Here is another simple example:
sum(1, 2, 3)
This function is even more obvious. It sums the values that appear in its arguments. Notice that for numeric values, there are no quotation marks. Quotation marks appear around character values.
– Executing code in R Studio.
Executing code is easy. Let’s take this last example. We can copy that snippet of code into our script in R Studio. Then, we highlight or select that line of code and click the “Run” button:
Doing so results in this output in the console (bottom left):
We can execute one line or many (whatever we select). Just be sure to select all of what you want to run. If we wanted to execute the second function above (with the returns after the commas), we would need to highlight all 3 lines including the closing parenthesis.
– Creating data objects and doing stuff with them.
Often, we don’t simply want to execute a function. We want to save the result so that we can something more with. For example, we don’t generally want to have R read a text. We want to save that text so we can do things to it like count words.
To create a data object is easy. Here we save the result of our sum:
my_result <- sum(1, 2, 3)
We are naming our object “my_result”. The name it is followed by a less-than sign and a hyphen (creating a kind of arrow). Selecting and running that line of code produces a new object in our environment (upper right):
Once we’ve created the object, we can do things with it. This would simply multiply it by 2:
my_result*2
This would create a duplicate object with a new name:
new_result <- my_result
Or we could name our concatenated characters (rather than the numeric value). We could then check the properties of our new object using “summary” and count the elements in our object using “length” (which, of course, is 3).
my_characters <- c("a", "b", "c") summary(my_characters) length(my_characters)
In R, we can create data objects of many different types. (Here’s a list of some of the most common ones.) Some are more list-like (lists and vectors), while others are more table-like (data frames and matrices). Not all functions work with all types of objects. So some of our work is often converting one type of object (like a “text” or character string) into another (like a “table” or data frame of word counts).
Workshop overview
For the workshop, we’ll be going through 3 scripts:
- text_intro.R
- corpus_intro.R
- sentiment_intro.R
Given our time constraints, I doubt that we’ll make it through all 3. However, they contain a fair bit of commentary that is meant to explain at each step what the code is doing. The hope is that you’ll be able to proceed through whatever we don’t complete on your own.
A couple of things to note. At the beginning of each script, you’ll see the following lines:
# Run this line if you have a Mac OS to set your working directory to the "wf_workshop" folder. setwd("~/Documents/wf_workshop") # Run this line if you have a Windows OS to set your working directory to the "wf_workshop" folder. setwd("~/wf_workshop")
Only run the line for your operating system (line 2 for a Mac, line 5 for Windows). These lines set the working directory, telling R where to find and save files later on the script. If you get an error, be sure that you’ve placed the unzipped “wf_workshop” folder in “Documents” as I explained above.
You’ll also note the lines beginning with the pound sign (#). That symbol “comments out” a line, so that R essentially skips over it. Even if you run it, nothing will happen.
text_intro.R
The first script introduces some basics of using R to process text:
- Working with character strings.
- Defining word boundaries (or text segmentation).
- Calculating word frequencies.
- Normalizing word frequencies.
- Using regular expressions to transform text.
- Creating our own user-defined functions.
The purpose here is just to get you familiar with how R works, and some of the ways we can use R to manipulate text and make calculations.
# Run this line if you have a Mac OS to set your working directory to the "wf_workshop" folder. setwd("~/Documents/wf_workshop") # Run this line if you have a Windows OS to set your working directory to the "wf_workshop" folder. setwd("~/wf_workshop") # Let's begin by creating an object consisting of a character string. # In this case, the first sentence from Sense and Sensibility. text <- ("It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.") # Using "summary" we can access the object's ("text") values. summary(text) # We can also call up the object itself. text # Doing things like counting words requires us to split a long string of text into individual constituents or "words". # To do this, we can use the function "strsplit", which requires us to specify the object we want to spit ("text") and where to split it (at spaces). text_split <- strsplit(text, " ") # If we call up the object, you'll notice that the result of this is a list, with elements that include punctuation (like "acknowledged,"). text_split # If we want to specify a different word boundary (say at anything that isn't a letter), we can use a regular expression. text_split <- strsplit(text, "\\W") # Now you'll note we've gotten rid of the punctuation, but we have created a couple of blank elements where we had sequences of punctuation marks followed by spaces. text_split # One way to create a cleaner list is to prep our original text before splitting it. # First we can eliminate punctuation using the gsub function. # This asks us to specify what we want to replace (punctuation), what we want to replace it with (nothing), and the object we want the function to operate on ("text"). # And we'll create a new object ("text_edited") text_edited <- gsub("[[:punct:]]", "", text) # Now we can convert that output into all lower case using "tolower". text_edited <- tolower(text_edited) # Let's take a look at the result. text_edited # And now let's split that at each space. text_split <- strsplit(text_edited, " ") # The result is a list with 23 elements. text_split # Note in the data space in the upper right of your workspace or in the summary output that we have a "List of 1". summary(text_split) # To easily manipulate and extract values, we can change our data into a different type or class of object. # Here we use the function "unlist". # After unlisting it, note how it's position has changed in the "Global Environment". text_split <- unlist(text_split) # In the summary, you'll see that our object now has a length of 23. # This is a useful value to access, as it tells us the number of constituent elements in our object. summary(text_split) # Let's calculate the total number of words in our sentence using the function "length". total_words <- length(text_split) # We can also identify each distinct word using the "unique" function. # This is data we'd want for word list, for example. word_types <- unique(text_split) # How many different types to we have? # Again, we can use "length". total_types <- length(word_types) # The number should be 18. total_types # For a type-to-token ratio, we can now perform a basic mathematical operation, dividing our "total_types" by our "total_words". total_types/total_words # Now let's work with a different kind of data object. # Data frames are like tables, allowing us to sort and calculate across rows and down columns. # We can coerce one from our list of unique words using "as.data.frame". corpus_data <- as.data.frame(word_types) # Next let's look at the output from the "match" function. # The function sequences along the first argument ("text_split"), which is our original sentence broken up into its component words. # Here we're looking for matches to the 3rd row from the first column in our data frame ("a"). match(text_split, corpus_data[3,1]) # The output says: no, no, yes!... # You can see how we can use this to count words. # We simply want to sum all of the matches (which for "a" would be 4) # To do that, let's write a simple function. # Our function (which we'll call "counts") will take one argument ("x"). # First, we want to find matches between our text ("text_split") and "x". # Then, we want to take that result ("word_matches") and find its sum, ignoring or removing all NAs ("na.rm = TRUE"). # Finally, we want to return that sum ("matches_total"). # The whole function needs to be enclosed in brackets "{}". counts <- function(x){ word_matches <- match(text_split, x) matches_total <- sum(word_matches, na.rm = TRUE) return(matches_total) } # Now we want to iterate through our data row by row. We want to "apply" the function. # And R has a variety methods for doing this: among them "apply", "sapply", and "mapply". # Here, we use "mapply", which is a version of "sapply" for multiple variables. # We simply tell it to apply the "counts" function to the "word_types" column in our "corpus_data" data frame. # We can access any column by name using the "$" operator. word_count <- mapply(counts, corpus_data$word_types) # To attach the result to our data frame we can use function "cbind", which joins data by column. # As you might guess, "rbind" joins data by row. corpus_data <- cbind(corpus_data, word_count) # To check our result, we can use the "$" operator again, this time to sum our column of word counts. # It should match the "total_words", which we calculated earlier using "length". sum(corpus_data$word_count) # Let's create another function to normalize our counts. # Again, we'll assign only one argument to our function, "x". # We want to divide "x" by the total number of words in our text, which we've already stored as "total_words". # Then, we'll multiply that by a normalizing factor of 100, giving us the percent (or frequency per 100 words). # Finally we'll use the "round" function to round the result to 2 decimal places. normalize <- function (x) { normal <- (x/total_words)*100 normal <- round(normal, 2) return(normal) } # Again, we use "mapply" to iterate down a column (the "word_count" column) in or data frame. frequency_norm <- mapply(normalize, corpus_data$word_count) # We'll attach the result using a different technique this time. # Rather than "cbind", we can use the column operator ("$") to add a column with the same name as our result ("frequency_norm") corpus_data$frequency_norm <- frequency_norm # We can sum the column to check our calculations. # Looks good with a small rounding error. sum(corpus_data$frequency_norm) # To conclude our brief introduction, let's take a look at how we can order our data frame. # You might have noted that our data frame remains in the order that it was originally created. # You can open it by clicking on the grid icon to the right in the "Global Environment". # And once open, there are arrows that enable you to sort the various columns. # However, sorting in this way is only temporary. # So let's learn one way of sorting our data frame to prepare it for output. # We're going to use the "order" function. # If we execute the "order" function on the "word_types" column, you'll see that it produces just a sequence of numbers. order(corpus_data$word_types) # These are our row indexes. # If you open the data frame by clicking on it, you'll see the numbers on the left. # Row 3 is "a" in the "word_types" column; 6 is "acknowledged"; and so on. # The number sequence is simply the indexes arranged by "word_types" (in alphabetical order). # So what if we want to see not just the indexes, but the entire data frame? # To understand how this works, you need to know how to specify rows and columns in a data frame. # Square brackets "[ ]" after a data frame indicate rows and columns, which are separated by a comma. # So to see row 3, column 2, we can do this, which shows the 2nd column ("word_count") for the 3rd row. corpus_data[3,2] # If we don't put any number before the comma, we return all rows of the second column. corpus_data[,2] # This is identical to using the column operator ("$") to specify a column by name. corpus_data$word_count # Likewise, we could put nothing after the comma to return all columns of the 3rd row. corpus_data[3,] # Knowing that syntax, we can understand what we're doing here. # We want to order the rows by "word_types" and return all columns of our data frame. # Thus, the order function goes before the comma, and nothing goes after it. corpus_data[order(corpus_data$word_types),] # We can use the same principle to order the data frame by "word_count". corpus_data[order(corpus_data$word_count),] # Note, however, that this only produces but does not preserve the reordering. # It also orders from low-to-high. # To preserve the order, we assign the ordering to our "corpus_data" object. # To reverse the order, we use the minus sign. corpus_data <- corpus_data[order(-corpus_data$word_count),] # Let's check the result. View(corpus_data)
You should see something that looks like this table:
word_types | word_count | frequency_norm |
---|---|---|
a | 4 | 17.39 |
in | 2 | 8.7 |
of | 2 | 8.7 |
it | 1 | 4.35 |
is | 1 | 4.35 |
truth | 1 | 4.35 |
universally | 1 | 4.35 |
acknowledged | 1 | 4.35 |
that | 1 | 4.35 |
single | 1 | 4.35 |
The last bit of our script simply save our table to the “output” folder, which we can then open as a spreadsheet.
# Finally, we can save our data frame as a table using "write.csv". # Because the names of the rows are simply numbers, we don't want to include them. # We can specify this with "row.names". write.csv(corpus_data, file="output/practice_data.csv", row.names = FALSE) # Now you see how text processing works with R's basic functions. # You can also see how this could get rather tedious and labor-intensive to create from scratch. # Thankfully, the little we've done here (and much, much more) are built in to packages like quanteda and tm. # That is where we go in the next script: "corpus_intro.R".
Before opening and beginning the next script, use the broom icon (circled in red) to clear out our work environment:
corpus_intro.R
Now that you know a little about how R works and some of the things we can do with text, let’s help ourselves by using a package.
Packages contain per-compiled functions that we can load and use. In the next script we’ll be using 3:
The first simply facilitates the loading of multiple texts (or a corpus). For some types of analysis, we might focus on a single text, but for others we might want to look for patterns or make comparisons across many texts. Here, we’ll be doing some of the latter.
The second, quanteda, is a very powerful and flexible suite of functions. This script makes use of only a few, but I would encourage you to visit the website to learn more about its potential applications. As I noted above, there are other similar packages (notably tm). They are also worth checking out.
The third, ggplot2, is a widely used and very flexible package for generating plots. It is also very well supported and documented.
In this script, we will learn how to:
- Load in a collection of texts
- Create and edit metadata about our texts
- Make some calculations based on our data summary
- Generate a list of word counts
- Compare subsets of our data to find keywords (or words more common in one set of texts versus another)
- Create and save different kinds of plots
- Calculate similarities among texts using cluster analysis
For our data, we will be using a small set of texts from the Michigan Corpus of Upper-Level Student Papers (MICUSP).
# To load the packages we'll be using, simply run all 3 lines. library(readtext) library(quanteda) library(ggplot2) # Run this line if you have a Mac OS to set your working directory to the "wf_workshop" folder. setwd("~/Documents/wf_workshop") # Run this line if you have a Windows OS to set your working directory to the "wf_workshop" folder. setwd("~/wf_workshop") # The first thing we are going to do is gather a list of files that we want to load into our corpus. # We could point the "readtext" function to a our "academic" directory, so why do it this way? # In or current file structure they are no subfolders, which is easy for "readtext." # However, if we did have subfolders, recursively iterating through those folders becomes increasing complicated. # Starting from a files list is a simple solution, no matter the underlying file structure of our corpus. # And note that the "list.files" function allows us to specify the type of file we want to load, as well as whether we want to locate files in subfolders ("recursive"). data_dir <- list.files("data/academic/", pattern="*.txt", recursive = TRUE, full.names = TRUE) # From the files list we can create our corpus object by combining the "corpus" and "readtext" functions. # There are some advantages in separating these steps. micusp_corpus <- corpus(readtext(data_dir)) # Now check the summary of our corpus object. summary(micusp_corpus) # Our data comes from multiple disciplines, which are indicated in the file names. # What if want to compare the type-to-token ratios from the disciplines? # There are a variety of ways to subset data in R. For columns containing numeric and categorical variables, "subset" is easy and useful. # Here, however, we are going to work on subsetting within our corpus object using quanteda's framework. # First, we can assign new metadata to our corpus using the "docvars" (document variables) function. # In this case we tell it to add the field "Discipline" to "micusp_corpus". # The next part is based on regular expressions. The expression says find the first three letters in the "doc_id" and copy those into the new column. docvars(micusp_corpus, "Discipline") <- gsub("(\\w{3})\\..*?$", "\\1", rownames(micusp_corpus$documents) # Checking the summary again, we see the new "Discipline" column. summary(micusp_corpus) # Let's begin with the some of things we did using basic functions in "text_intro.R". # With quanteda, however, we can use built-in functions to make our work much easier. # We'll start by creating a "document-feature matrix". # This function creates a data object that stores information we can then extract in various ways. # Notice the various arguments we can use. # These give us control over what exactly we want to count. # We'll call our data object "micusp_dfm". micusp_dfm <- dfm(micusp_corpus, groups = "Discipline", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) # Let's look at the 10 most frequent words in "micusp_dfm" by calling "textstat_frequency". textstat_frequency(micusp_dfm, n=10) # We can also coerce those frequency counts into a data frame. micusp_counts <- as.data.frame(textstat_frequency(micusp_dfm)) # We can also calculate a normalized frequency just as we did in "text_intro.R". # First we calculate and store the total number of words in our corpus. total_words <- sum(micusp_counts$frequency) # We can use the same "normalize" function. # Though this one normalizes per 1 million words. normalize <- function (x) { normal <- (x/total_words)*1000000 normal <- round(normal, 2) return(normal) } # Use "mapply" to iterate down the frequency column. frequency_norm <- mapply(normalize, micusp_counts$frequency) # Append the result to our data frame. micusp_counts$frequency_norm <- frequency_norm # Check the result. View(micusp_counts) # We might be interested in knowing what words distinguish one group from another. # In our corpus, groups are defined by discipline. # Using the "textstat_keyness" function, we can, for example, comparing English to the rest of the corpus. # Here we can generate the top 10 keywords. # We can select a variety of measures, but "lr" is log-likelihood (with a Williams correction as a default). head(textstat_keyness(micusp_dfm, target = "ENG", measure = "lr"), 10) # Or we could use biology as our target. head(textstat_keyness(micusp_dfm, target = "BIO", measure = "lr"), 10) # What if we wanted to compare English and biology specifically? # To do that, we would create a separate sub-corpus using the "corpus_subset" function. micusp_compare <- corpus_subset(micusp_corpus, Discipline %in% c("BIO", "ENG")) # Again, we need to create a document-feature matrix. compare_dfm <- dfm(micusp_compare, groups = "Discipline", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) # Let's look at the top 10 keywords. # Note the count in the reference corpus ("n_reference") is much smaller than when we did this the first time. # That's because our reference corpus consists only of biology texts, rather than texts from all other disciplines. # Interesting, though, the list is quite similar. head(textstat_keyness(compare_dfm, target = "ENG", measure = "lr"), 10) # And for biology. head(textstat_keyness(compare_dfm, target = "BIO", measure = "lr"), 10) # If we want to save our keyness data, we can create a data frame. micusp_keyness <- as.data.frame(textstat_keyness(compare_dfm, target = "ENG", measure = "lr")) # Check the result. View(micusp_keyness) # Note the scientific notion in our p-values column and the many decimal places in the keyness ("G2") column. # We going to convert these columns. # The first to 5 decimal places and the other to 2. micusp_keyness[,"p"] <- round(micusp_keyness[,"p"],5) micusp_keyness[,"G2"] <- round(micusp_keyness[,"G2"],2) # Check the result View(micusp_keyness) # Save the table to the "output" folder. write.csv(micusp_keyness, file="output/micusp_keyness.csv", row.names = FALSE)
The output should be a table that looks like this one:
feature | G2 | p | n_target | n_reference |
---|---|---|---|---|
his | 396.62 | 0 | 281 | 12 |
he | 317.19 | 0 | 223 | 9 |
she | 163.48 | 0 | 113 | 4 |
her | 154.96 | 0 | 133 | 13 |
him | 115.72 | 0 | 73 | 1 |
pardoner | 108.04 | 0 | 63 | 0 |
chaucer | 102.89 | 0 | 60 | 0 |
as | 97.81 | 0 | 266 | 131 |
god | 89.16 | 0 | 52 | 0 |
jewish | 75.43 | 0 | 44 | 0 |
# Now let's go back to our corpus summary. summary(micusp_corpus, n=10) # As we've done with other data, we can coerce the summary into a data frame. # As we have more than 100 texts, we set "n" (number) greater than what we have (170). # Otherwise, the data frame will only include the first 100 texts. micusp_summary <- as.data.frame(summary(micusp_corpus, n=200)) # Check the result. View(micusp_summary) # As did in "text_intro.R", we can return the total word count using "sum". sum(micusp_summary$Tokens) # We can also get a token count by discipline. # There are a number of ways of doing this, but one is simply to specify an additional attribute. # Here we are specifying values in the "Discipline" column. sum(micusp_summary$Tokens[micusp_summary$Discipline == "BIO"]) # Or we can use the aggregate function to sum "Tokens" by the "Discipline" variable. aggregate(Tokens ~ Discipline, micusp_summary, sum) # Similarly, we could calculate the average number of tokens per sentence. # Here we divide the sum of the "Tokens" column by the sum of the "Sentences" column. sum(micusp_summary$Tokens)/sum(micusp_summary$Sentences) # What if we wanted to calculate other information, like the type-to-token ratios for each document? # For that, we can just create our own, very simple function, which we name "simple_ratio". # This is almost identical to the functions we wrote in "text_intro.R". # However, note that we're requiring 2 arguments -- "x" and "y". simple_ratio <- function(x,y){ ratio <- x/y ratio <- round(ratio, 2) return(ratio)} # Now we want to iterate through our data row by row. We want to "apply" the function. # Again, just as we did in "text_intro.R", we'll use "mapply". TypeToken <- mapply(simple_ratio, micusp_summary$Types, micusp_summary$Tokens) # The resulting vector we can easily append to our "micusp_summary" data frame. micusp_summary$TypeToken <- TypeToken # Check the result. View(micusp_summary) # Let's create a basic boxplot. # We'll specify the data frame we're using ("micusp_summary"). # We'll also specify the x-axis (the "Discipline" column) and the y-axis (the "TypeToken" column). # Finally, we'll tell ggplot2 which type of plot to generate (a box plot) and the theme to use (the minimal theme). # Note the plus signs at the end of the lines. # This tells R that more functions are to come, even though there is a closed parenthesis # The plot will appear in your "Plots" space on the bottom right. ggplot(micusp_summary, aes(x=Discipline, y=TypeToken)) + geom_boxplot() + theme_minimal() # Note the ordering along the x-axis is by alphabetical order in our first plot. # Let's recreate the plot, but this time using "reorder" to arrange the x-axis. # We're telling ggplot2 to reorder "Discipline" by "TypeToken" using the function ("FUN") median. ggplot(micusp_summary, aes(x=reorder(Discipline, TypeToken, FUN = median), y=TypeToken)) + geom_boxplot() + theme_minimal() # This is better. # But we can also change the axis labels. # And we will store this one as a object (called "micusp_boxplot") for output. micusp_boxplot <- ggplot(micusp_summary, aes(x=reorder(Discipline, TypeToken, FUN = median), y=TypeToken)) + geom_boxplot() + xlab("Discipline") + ylab("Type-to-Token Ratio") + theme_minimal() # Check the plot. micusp_boxplot # Save the plot to the "output" folder. ggsave("output/micusp_boxplot.png", plot = micusp_boxplot, width=8, height=5, dpi=300)
Our first plot should look like this:
This is a box plot showing the distribution of type-to-token ratios in each discipline and arranged by their medians (from the lowest in Nursing to the highest in Natural Resources & Environment).
# Now we can calculate the mean type-to-token ratios for each discipline. # And we'll save those to an object we'll call "micusp_tt". micusp_tt <- aggregate(TypeToken ~ Discipline, micusp_summary, mean) # Check the result. micusp_tt # Let's use our ratio function again. # This time we'll use it to calculate sentence length. SentenceLength <- mapply(simple_ratio, micusp_summary$Tokens, micusp_summary$Sentences) # Again the resulting vector can be appended to our "micusp_summary" data frame. micusp_summary$SentenceLength <- SentenceLength # Check the result. View(micusp_summary) # Using the "aggregate" function, we can calculate the mean sentence length by discipline. micusp_sl <- aggregate(SentenceLength ~ Discipline, micusp_summary, mean) # Another very useful function for combining data is "merge". # "Merge" takes arguments telling it the objects you want to put together and a "by" argument. # The result is a new data frame we're calling "micusp_means". micusp_means <- merge(micusp_sl, micusp_tt, by = "Discipline") # Check the result. View(micusp_means) # To create a barplot, we use the "ggplot" function. # For aesthetics ("aes"), we specify the x and y axes. # Here, again, we use "reorder" for the x-axis. # This time to arrange the "Discipline" categories by "SentenceLength". # We also specify the colors of the bar outline and fill. micusp_barplot <- ggplot(micusp_means, aes(x = reorder(Discipline, SentenceLength), y = SentenceLength)) + geom_bar(colour="black", fill = "steelblue", stat="identity") + xlab("Discipline") + ylab("Mean Sentence Length") + theme_minimal() # Check the plot. micusp_barplot # Save the plot to the "output" folder. ggsave("output/micusp_barplot.png", plot = micusp_barplot, width=8, height=4, dpi=300)
Our next plot should look like this:
This is a simple bar plot showing the mean sentence length for each discipline and arranged lowest (in Civil & Environmental Engineering) to highest (in History).
# Let's make one more plot using our "micusp_means" data frame. # This one will be a scatter plot. # We will plot mean sentence length along the x-axis and mean type-to-token ratio along the y-axis. # Each point on the plot will represent a discipline. # For those will create a label (from the "Discipline" column). micusp_scatterplot <- ggplot(micusp_means, aes(x = SentenceLength, y = TypeToken, label = Discipline)) + geom_point(color="tomato") + geom_text(aes(label=Discipline),hjust=.5, vjust=-1) + xlab("Mean Sentence Length") + ylab("Mean Type-to-Token Ratio") + theme_minimal() # Check the plot. micusp_scatterplot # Save the plot to the "output" folder. ggsave("output/micusp_scatterplot.png", plot = micusp_scatterplot, width=8, height=6, dpi=300)
The third plot should look like this:
This is a scatter plot with label added for each point. The plot doesn’t suggest any linear relationship between the measures, but it does show some potentially interesting groupings of disciplines.
# For our last plot, we're going to create a dendrogram. # The plot will show similarities among disciplines based on word frequencies. # First we need to normalize the counts using the "dfm_weight" function. # And based on those, we can get distances using "textstat_dist". micusp_dist <- textstat_dist(dfm_weight(micusp_dfm, "prop")) # From that distance object, we can use the basic function "hclust" to create a heirarchical cluster. micusp_cluster <- hclust(micusp_dist) # Now we can generate the dendrogram. plot(micusp_cluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency") # To save the plot to the "output" folder, run these 3 lines together. png(filename = "output/micusp_dendrogram.png", width = 8, height = 5, units = "in", res = 300) plot(micusp_cluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency") dev.off()
The final plot should look like this:
This is a dendrogram (see also here). It groups or clusters the disciplines according to their normalized word frequencies. On the left, for example, we have a cluster of Education, Nursing, and Sociology. In the center, a cluster of English and Philosophy.
We are using a very small set of data (only ten papers per discipline). The results, however, are at least intriguing. For some more applications of clustering see Pamphlet 1 from Franco Moretti’s Literary Lab. In it, they use DocuScope and heirarchical cluster analysis to locate patterns in literary texts.
Before opening the next script, use the broom icon (upper right and illustrated at the end of the text_intro.R section) to clear your data environment. Use a similar broom to clear the plots (bottom right).
sentiment_intro.R
The third script applies sentiment analysis to literary texts. Sentiment analysis measures the emotional content of texts. Often this done on a simple positive-to-negative scale. Sometimes it measures sentiment according to semantic domains like “joy” and “fear”. We’re going to do a little of both.
One of the purposes here is to demonstrate some of the possiblities for analysis beyond the familiar counting of words or phrases. (Not that there’s anything wrong with counting words!)
This script makes use of the syhuzet package, which was developed my Mathhew Jockers. We will learn how to:
- Segment a text into sentences
- Meaure the semantic content of those sentences
- Plot changes in sentiment over narrative time
- Compare changes in the sentiment trajectories of two novels
- Compare the changes in different types of sentiment in the same novel
For this, we will be using Mary Shelley’s Frankenstein, Jane Austen’s Pride and Prejudice, and Emily Brontë‘s Wuthering Heights.
library(syuzhet) library(ggplot2) # Run this line if you have a Mac OS to set your working directory to the "wf_workshop" folder. setwd("~/Documents/wf_workshop") # Run this line if you have a Windows OS to set your working directory to the "wf_workshop" folder. setwd("~/wf_workshop") # First, we need to read in the text. frankenstein <- get_text_as_string("data/literature/Shelley_Frankenstein.txt") # Next we need to parse the text into sentences. frankenstein_v <- get_sentences(frankenstein) # Then we calculate a sentiment value for each sentence using the "get_sentiment" function. frankenstein_sentiment <- get_sentiment(frankenstein_v) # Now run these 3 lines together to write a plot to the "output" folder. # The plot measures sentiment from 1 (positive) to -1 (negative) along the y-axis. # And narrative time along the x-axis (by sentence in the first plot and by normalized time in the second). png(filename = "output/frankenstein_sentiment_simple.png", width = 7, height = 7, units = "in", res = 300) simple_plot(frankenstein_sentiment) dev.off()
This first plot is one that is generated by the syuzhet package:
The plot shows the sentiment falling from maximally positive to negative over the course of the novel with a fluctuation close to the midway point.
# The plot we generated above shows a trajectory that has been smoothed to eliminate some of the noise in the raw data. # To access those values, we need to use the "get_dct_transform" function -- a discrete cosine transformation. # In the arguments, we've specified we want 100 time units and to scale or sentiment range (from 1 to -1) frankenstein_dct <- get_dct_transform(frankenstein_sentiment, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = TRUE) # We can put that into a data frame, and call the column "dct". frankenstein_df <- data.frame(dct = frankenstein_dct) # And we can add a column called "narrative_time", which simply contains the numbers 1-100 present in our row names. frankenstein_df$narrative_time <- as.numeric(row.names(frankenstein_df)) # Check our data frame. View(frankenstein_df) # Using that data frame, we can now create a plot just like the bottom one created by "simple_plot". # This is useful because we now have access to the full functionality of ggplot2. # We can customize our plot in whatever way we want. # Moreover, we can more easily compare the sentiment from multiple works, which we will do shortly. sentimentplot_frankenstein <- ggplot(data=frankenstein_df, aes(x=narrative_time, y=dct, group=1)) + geom_line(colour= "tomato") + xlab("Normalized Narrative Time") + ylab("Scaled Sentiment") + theme_minimal() # Let's look at our plot. sentimentplot_frankenstein # Now, let's save it to our "output" folder. ggsave("output/frankenstein_sentiment_dct.png", plot=sentimentplot_frankenstein, width=8, height=3, dpi=300)
The second plot is a recreation of the bottom part of the first one:
Rather than using the built-in “simple_plot” function, we generate this one with ggplot2. Doing so, gives us much greater flexibility over what we want to plot to look like. Moreover, in ggplot2, we can add as many additional lines to the chart as we want. Thus, with this technique we can look at the trends in multiple texts, not just one.
# We can use a similar set of steps to create a plot comparing 2 novels. # For our comparison, we'll use Pride & Prejudice and Wuthering Heights. # We'll read in P & P, parse the text by sentence and get our sentiment measurements. # Just run all 3 lines. pride_prejudice <- get_text_as_string("data/literature/Austen_PrideAndPrejudice.txt") pride_prejudice_v <- get_sentences(pride_prejudice) pride_prejudice_sentiment <- get_sentiment(pride_prejudice_v) # Do the same for Wuthering Heights. wuthering_heights <- get_text_as_string("data/literature/Bronte_WutheringHeights.txt") wuthering_heights_v <- get_sentences(wuthering_heights) wuthering_heights_sentiment <- get_sentiment(wuthering_heights_v) # Calculate the transformed values and normalize the time scale. pride_prejudice_dct <- get_dct_transform(pride_prejudice_sentiment, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = TRUE) wuthering_heights_dct <- get_dct_transform(wuthering_heights_sentiment, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = TRUE) # Create a data frame for P & P. # Note the 3rd line of code here. # That simply creates a 3rd column called "novel", which repeats the same value "pp". # The purpose of that column is to create a variable that we can use as category differentiating one novel from the other. pride_prejudice_df <- data.frame(dct = pride_prejudice_dct) pride_prejudice_df$narrative_time <- as.numeric(row.names(pride_prejudice_df)) pride_prejudice_df$novel <- rep("pp",len=100) # Make a similar data frame for WH. wuthering_heights_df <- data.frame(dct = wuthering_heights_dct) wuthering_heights_df$narrative_time <- as.numeric(row.names(wuthering_heights_df)) wuthering_heights_df$novel <- rep("wh",len=100) # Now let's put those together in a single data frame. # The function "rbind" combines the data frames row-wise. compare_df <- rbind(pride_prejudice_df, wuthering_heights_df) # Check the result. View(compare_df) # Now we can use that data frame to generate our plot. # Note that in the aesthetic ("aes") arguments, we need to specify our categorical variable or "group". # We can also specify the categorical variable that we want to assign to "colour". # In this case, the "group" and "colour" are the same. sentimentplot_compare <- ggplot(data=compare_df, aes(x=narrative_time, y=dct, group=novel, colour = novel)) + geom_line() + xlab("Normalized Narrative Time") + ylab("Scaled Sentiment") + scale_color_manual(values=c("tomato", "steelblue"), name="Novel", labels=c("Pride and Prejudice", "Wuthering Heights")) + theme_minimal() # Check the plot. sentimentplot_compare # And save it to our "output" folder. ggsave("output/compare_sentiment_dct.png", plot=sentimentplot_compare, width=8, height=3, dpi=300)
This plot uses the same technique we used in the preceding plot to compares the sentiment in Pride and Prejudice with Wuthering Heights:
These seem to show greater fluctuations in sentiment than what we saw in Frankenstein and in slightly offset intervals.
# In addition to looking at fluctuations in overall sentiment, we can measure specific emotional categories like anger and joy. # This requires us to apply the "get_nrc_sentiment" function to our parsed text. # The function relies on the NRC (National Research Council) Sentiment Lexicon or EmoLex. # This processes takes a couple of minutes. frankenstein_nrc_full <- get_nrc_sentiment(frankenstein_v) # From that result, we'll calculate the percentage of each category to the total sentiment. # For this, we take the sums of the first 8 columns and use the "prop.table" function to calculate their percentages. # We coerce the result into a data frame, calling column "percentage". frankenstein_nrc_prop <- data.frame(percentage=colSums(prop.table(frankenstein_nrc_full[, 1:8]))) # Check the result. View(frankenstein_nrc_prop) # Let's make the row names a column called "emotion" for plotting. frankenstein_nrc_prop$emotion <- row.names(frankenstein_nrc_prop) # Reorder the data by the "percentage" column. frankenstein_nrc_prop$emotion <- factor(frankenstein_nrc_prop$emotion, levels = frankenstein_nrc_prop$emotion[order(frankenstein_nrc_prop$percentage)]) # From the resulting data frame, we can generate a bar plot. # Note that we're using "coord_flip" to rotate the plot sideways. nrc_plot <- ggplot(data=frankenstein_nrc_prop, aes(x=emotion, y=percentage)) + geom_bar(stat= "identity") + xlab("") + ylab("Percentage") + coord_flip() + theme_minimal() # View the plot nrc_plot # And save it to our "output" folder. ggsave("output/frankenstein_sentiment_nrc.png", plot=nrc_plot, width=8, height=5, dpi=300)
This bar plot is generated using the National Research Council of Canada’s EmoLex:
It shows the percentage of 8 different types of emotion in Frankenstein. In addition to calculating the contribution of each category to the total, we can combine this data with the technique for plotting changes in sentiment over narrative time.
# Let's make one final plot from this data. # The most common emotions in Frankenstein, according to the NRC Lexicon, are "trust" and "fear". # So let's attract those two measurements in the 8th and 4th columns. frankenstein_trust <- frankenstein_nrc_full[, 8:8] frankenstein_fear <- frankenstein_nrc_full[, 4:4] # As we did with the basic sentiment plot, we can use the "get_dct_transform" function. # This will smooth, normalize and scale our 2 measurements. frankenstein_trust_dct <- get_dct_transform(frankenstein_trust, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = TRUE) frankenstein_fear_dct <- get_dct_transform(frankenstein_fear, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = TRUE) # Just as we've done previously, we can coerce the results into data frames. # And we add a categorical variable column we label "emotion". trust_df <- data.frame(dct = frankenstein_trust_dct) trust_df$narrative_time <- as.numeric(row.names(trust_df)) trust_df$emotion <- rep("trust",len=100) fear_df <- data.frame(dct = frankenstein_fear_dct) fear_df$narrative_time <- as.numeric(row.names(fear_df)) fear_df$emotion <- rep("fear",len=100) # Join the 2 data frames together. compare_emotions <- rbind(trust_df, fear_df) # From that data frame we can generate our plot. emotions_compare <- ggplot(data=compare_emotions, aes(x=narrative_time, y=dct, group=emotion, colour = emotion)) + geom_line() + xlab("Normalized Narrative Time") + ylab("Scaled Sentiment") + scale_color_manual(values=c("tomato", "steelblue"), name="Emotion", labels=c("Fear", "Trust")) + theme_minimal() # Check the plot. emotions_compare # And save it to our "output" folder. ggsave("output/frankenstein_compare_nrc.png", plot=emotions_compare, width=8, height=4, dpi=300)
This final plot looks similar to the Pride and Prejudice/Wuthering Heights comparison:
This one, however, shows the relationship between “fear” and “trust” (the two most common emotions in Frankenstein) as the novel progresses. Broadly as “trust” declines, “fear” rises.
Final thoughts
These scripts cover a lot of ground. The purpose is not to have you understand their workings at every step, but rather to give you a foothold. Now that you’ve worked with them a little, you’re free to build from them, take them apart, or use them as a germ for your own projects.
It takes time to develop expertise. But you don’t need to be an expert in writing code to do interesting and rigorous work with R. As part of that process, I hope that you begin to see programming less as a black box, and more as a set of tools with strengths, but also limitations.
Learning the practical, concrete skills of coding in R, in fact, can help to clarify some of those limitations. The “syuzhet” package is, itself, an interesting case study (see here and here).
For further reading:
Stanford Literary Lab Pamphlets
Macroanalysis by Matthew Jockers
Applying corpus methods to written academic texts: Explorations of MICUSP