Building a Tweet Sentiment Corpus

I’ve had quite a few students who are interested in using Twitter data for corpus analysis. I understand why. Twitter is widespread, and it’s relatively new communicative technology. New technologies have often been drivers of language change. So this is an inviting source of data.

But even if we decide, “Yes! Twitter is for me!” collecting the data and putting it into a format that is friendly for commonly used tools like AntConc presents some very real challenges.

Given both the possibilities and the potential hurdles, I thought I’d walk though a coffee break experiment I did this morning. First, I want to share the results I generated and what I think they tell us. Then, I’ll show you how I got there.

One of the areas of corpus research that has really taken off lately is sentiment analysis from social media data (like Twitter). Sentiment analysis attempts to measure attitudes: How do we feel about that movie? Was it good? Lousy? To do so a score is assigned, positive or negative, to a text. And it’s easy to understand why this kind of analysis is popular. It provides people with interest in how things are being talked about in the wider world (like marketers) a number with an air of scientific objectivity that says, “Hey, this is getting great buzz!” or maybe, “We had better rethink this…”

So I thought I would give this a try (again more about the specifics in a minute). Given all of news regarding the draft of Michael Sam, the first openly gay player to be drafted by the NFL, I though that I would build a corpus of tweets containing the hashtag #MichaelSam, and apply sentiment analysis. These are my results.


The technique I used assigns a score of 1 to 5 for positive sentiment (5 being the most positive) and -1 to -5 for negative sentiment (-5 being the most negative. What I got was this rather perfect bell curve. I was a bit surprised by the number of neutral tweets, but many of them are like this one:

2014 #NFL Draft: Decision day for Michael Sam

But others that scored a 0 were like these:

Men kiss men every day, all over the world.  Get. Over. It. #MichaelSam

I’m Not Forcing You To Be A Christian Don’t Force Me To Be Okay With Homosexuality Stop Shoving it down People’s Throats #MichaelSam #ESPN

These examples raise some interesting questions about how sentiment analysis works.

The particular technique I used is lexical (as most sentiment analysis is, though some researchers are working out alternatives). It applies lists of positive and negative words to the texts in the corpus and calculates its score. This raises a number of problems.

The first is polysemy. Words, of course, have multiple meanings and treating those meanings as a unified class can erase significant ambiguities. This is particularly true with contranyms (words that have opposite meanings). Contranyms are common in slang, where negative words (like bad) take on new, positive associations. This is one of my favorite examples from the Guardian.


Another issue arises from the creative respelling and punctuation that are important in technologically-mediated communication. In one of the examples above the author’s stance is communicated through the phrase “get over it,” and that stance is amplified by the insertion of periods between each word. That emphasis is going to be overlooked if we are only measuring words.

Finally, there is the difficulty of locating the object of sentiment. While a tweet may express a negative sentiment, and that tweet may contain #MichaelSam, that negative sentiment may be directed at something else. Here are a couple of examples:

People used to discriminate blacks and now people want to discriminate gays. Everyone’s welcome in the future. Just let it be. #MichaelSam

Why are parents struggling to explain #MichaelSam to their kids? In 2014 are we that afraid of homosexuality still? #GrowUpAmerica

In both of these it’s easy to see what words are being counted as negative. However, the negativity is being directed at a perceived discriminatory culture rather than at Michael Sam or the events at the draft. In that way, we might be inclined to classify them as positive rather than negative.

Sentiment analysis clearly presents challenges if you’re interested in this kind of work.

Now, how did I build my corpus? It’s not too difficult, but you’ll need to use R — a programming language that is often used in corpus linguistics. The good news is that you DO NOT need to know how to write code. You just need to follow the steps and make use of resources that have already been developed.

I based my analysis on resources provided by the Information Research and Analysis Lab at the University of North Texas. The link has tutorial videos and an R script that you can download.

You will also need R Studio (which is a programming interface), the opinion lexicon from University of Illinois Chicago, and you will need to sign up with the Twitter API. This last bit may sound technical, but it isn’t really. Twitter doesn’t let unidentified users to grab its data like it used to. You just need to sign in with a Twitter account, click a button to create an App. (You can call it anything; it doesn’t matter.) Then, you are provided keys, a series letters and numbers that gives you access to the Twitter stream. Don’t worry, the video tutorials will walk you through the process.

Finally, a couple of tips if you’ve never seen or worked with code before. First, the videos will have you assemble three corpora about baseball. A snippet of the code looks like this:

Rangers.list <- searchTwitter('#Rangers', n=1000, cainfo="cacert.pem")
Rangers.df = twListToDF(Rangers.list)
write.csv(Rangers.df, file='C:/temp/RangersTweets.csv', row.names=F)

See #Rangers in blue between the single quotes in line 1? That is telling your code to search the Twitter stream for tweets with that hashtag. Once you go through the process and get the hang of it, you can change what’s between the quotation marks to whatever you want. Be aware, however, that you must have something there. That variable is required.

Also, you can’t just run the code without doing anything. You will need to tell R where to find things on your computer and where to put them. This requires defining what is called the “path.” The path just identifies the location of a file. You can see a path in line 3 above between the single quotation marks after “file=”. You can also see it the example below in lines 2 and 3 between the single quotation marks starting with “C:/”.

#Load sentiment word lists
hu.liu.pos = scan('C:/temp/positive-words.txt', what='character', comment.char=';')
hu.liu.neg = scan('C:/temp/negative-words.txt', what='character', comment.char=';')

You are going to have to edit these paths in order to successfully execute the code. Fortunately, this is easy. To find the location of a file on Windows, right-click on the file and select “Properties.” The path is listed there, and I’ve highlighted it in yellow.


On a Mac, control+click on the file and select “Get Information.” Again, I’ve highlighted the path in yellow.


Note that with Windows, the path (as long as the file is on your hard drive) will start with “C:”. That is just the letter designation that Windows assigns to your main drive. On a Mac, this is different. There is no letter.

In either case, you can just copy the path and paste it into the code. The copied path, however, will not contain the name of the file itself. So you will need to type in a backslash (/) and the name of the file: RangersTweets.csv, positive-words.txt, or whatever.

Otherwise, you should be able to just follow along with the tutorial. If something gets messed up, you can re-download the code and start fresh.