Building a Tweet Sentiment Corpus

13 May
May 13, 2014

I’ve had quite a few students who are interested in using Twitter data for corpus analysis. I understand why. Twitter is widespread, and it’s relatively new communicative technology. New technologies have often been drivers of language change. So this is an inviting source of data.

But even if we decide, “Yes! Twitter is for me!” collecting the data and putting it into a format that is friendly for commonly used tools like AntConc presents some very real challenges.

Given both the possibilities and the potential hurdles, I thought I’d walk though a coffee break experiment I did this morning. First, I want to share the results I generated and what I think they tell us. Then, I’ll show you how I got there.

One of the areas of corpus research that has really taken off lately is sentiment analysis from social media data (like Twitter). Sentiment analysis attempts to measure attitudes: How do we feel about that movie? Was it good? Lousy? To do so a score is assigned, positive or negative, to a text. And it’s easy to understand why this kind of analysis is popular. It provides people with interest in how things are being talked about in the wider world (like marketers) a number with an air of scientific objectivity that says, “Hey, this is getting great buzz!” or maybe, “We had better rethink this…”

So I thought I would give this a try (again more about the specifics in a minute). Given all of news regarding the draft of Michael Sam, the first openly gay player to be drafted by the NFL, I though that I would build a corpus of tweets containing the hashtag #MichaelSam, and apply sentiment analysis. These are my results.


The technique I used assigns a score of 1 to 5 for positive sentiment (5 being the most positive) and -1 to -5 for negative sentiment (-5 being the most negative. What I got was this rather perfect bell curve. I was a bit surprised by the number of neutral tweets, but many of them are like this one:

2014 #NFL Draft: Decision day for Michael Sam

But others that scored a 0 were like these:

Men kiss men every day, all over the world.  Get. Over. It. #MichaelSam

I’m Not Forcing You To Be A Christian Don’t Force Me To Be Okay With Homosexuality Stop Shoving it down People’s Throats #MichaelSam #ESPN

These examples raise some interesting questions about how sentiment analysis works.

The particular technique I used is lexical (as most sentiment analysis is, though some researchers are working out alternatives). It applies lists of positive and negative words to the texts in the corpus and calculates its score. This raises a number of problems.

The first is polysemy. Words, of course, have multiple meanings and treating those meanings as a unified class can erase significant ambiguities. This is particularly true with contranyms (words that have opposite meanings). Contranyms are common in slang, where negative words (like bad) take on new, positive associations. This is one of my favorite examples from the Guardian.


Another issue arises from the creative respelling and punctuation that are important in technologically-mediated communication. In one of the examples above the author’s stance is communicated through the phrase “get over it,” and that stance is amplified by the insertion of periods between each word. That emphasis is going to be overlooked if we are only measuring words.

Finally, there is the difficulty of locating the object of sentiment. While a tweet may express a negative sentiment, and that tweet may contain #MichaelSam, that negative sentiment may be directed at something else. Here are a couple of examples:

People used to discriminate blacks and now people want to discriminate gays. Everyone’s welcome in the future. Just let it be. #MichaelSam

Why are parents struggling to explain #MichaelSam to their kids? In 2014 are we that afraid of homosexuality still? #GrowUpAmerica

In both of these it’s easy to see what words are being counted as negative. However, the negativity is being directed at a perceived discriminatory culture rather than at Michael Sam or the events at the draft. In that way, we might be inclined to classify them as positive rather than negative.

Sentiment analysis clearly presents challenges if you’re interested in this kind of work.

Now, how did I build my corpus? It’s not too difficult, but you’ll need to use R — a programming language that is often used in corpus linguistics. The good news is that you DO NOT need to know how to write code. You just need to follow the steps and make use of resources that have already been developed.

I based my analysis on resources provided by the Information Research and Analysis Lab at the University of North Texas. The link has tutorial videos and an R script that you can download.

You will also need R Studio (which is a programming interface), the opinion lexicon from University of Illinois Chicago, and you will need to sign up with the Twitter API. This last bit may sound technical, but it isn’t really. Twitter doesn’t let unidentified users to grab its data like it used to. You just need to sign in with a Twitter account, click a button to create an App. (You can call it anything; it doesn’t matter.) Then, you are provided keys, a series letters and numbers that gives you access to the Twitter stream. Don’t worry, the video tutorials will walk you through the process.

Finally, a couple of tips if you’ve never seen or worked with code before. First, the videos will have you assemble three corpora about baseball. A snippet of the code looks like this:

Rangers.list <- searchTwitter('#Rangers', n=1000, cainfo="cacert.pem")
Rangers.df = twListToDF(Rangers.list)
write.csv(Rangers.df, file='C:/temp/RangersTweets.csv', row.names=F)

See #Rangers in blue between the single quotes in line 1? That is telling your code to search the Twitter stream for tweets with that hashtag. Once you go through the process and get the hang of it, you can change what’s between the quotation marks to whatever you want. Be aware, however, that you must have something there. That variable is required.

Also, you can’t just run the code without doing anything. You will need to tell R where to find things on your computer and where to put them. This requires defining what is called the “path.” The path just identifies the location of a file. You can see a path in line 3 above between the single quotation marks after “file=”. You can also see it the example below in lines 2 and 3 between the single quotation marks starting with “C:/”.

#Load sentiment word lists
hu.liu.pos = scan('C:/temp/positive-words.txt', what='character', comment.char=';')
hu.liu.neg = scan('C:/temp/negative-words.txt', what='character', comment.char=';')

You are going to have to edit these paths in order to successfully execute the code. Fortunately, this is easy. To find the location of a file on Windows, right-click on the file and select “Properties.” The path is listed there, and I’ve highlighted it in yellow.


On a Mac, control+click on the file and select “Get Information.” Again, I’ve highlighted the path in yellow.


Note that with Windows, the path (as long as the file is on your hard drive) will start with “C:”. That is just the letter designation that Windows assigns to your main drive. On a Mac, this is different. There is no letter.

In either case, you can just copy the path and paste it into the code. The copied path, however, will not contain the name of the file itself. So you will need to type in a backslash (/) and the name of the file: RangersTweets.csv, positive-words.txt, or whatever.

Otherwise, you should be able to just follow along with the tutorial. If something gets messed up, you can re-download the code and start fresh.


What Counts as “Good” Writing in High School?

09 May
May 9, 2014

The question posed for this post is not an easy one to answer. But I want to propose an at least partial one based on some data that I and a colleague (Laura Aull at Wake Forest) have been working on. The data comes from Advanced Placement Literature and Advanced Placement Language exams. Altogether, we have about 400 essays (~300 Language essays and ~100 Literature essays) totaling about 150,000 words.

We digitized the essays and used a tagger to code them for part-of-speech. After tagging the corpus, we ran some statistical analyses on some of the differences between high scoring exams (those receiving a score of 4 or 5) and low scoring exams (those receiving a 1 or 2). I’ve put some of the results in the data visualization below. If you use the mouse to hover over the noun and verb bubbles, you will see their frequencies and their statistical significance.
AP Bubble Chart

Okay. So what are we looking at? The parts-of-speech I’ve put on the graph (the preposition of, pronouns, adjectives, articles, verbs, and nouns) are all unequally distributed in high and low scoring exams. In other words, some are more frequent in high scoring exams; some more frequent in lower scoring exams.

The x-axis shows the frequency of the part-of-speech  in high scoring exams, and the y-axis shows the frequency in low scoring exams. The red bubbles are features that distinguish low scoring exams. The blue bubbles are features that distinguish high scoring exams.

For example, nouns are more common in higher scoring exams than in lower scoring ones. By contrast, verbs are more common in lower scoring exams.

That significance is measured by a chi-square test. The greater the significance, the larger the bubble. But be aware that all of the features shown on this graph are statistically significant. So even though the noun bubble looks relatively small, nouns differentiate higher scoring exams from lower scoring ones.

Now that we have a handle on what the graph is showing, what does it mean? What does it tell us about what distinguishes writing that gets rewarded on AP exams from writing that doesn’t?

One really interesting pattern is how these features follow patterns of involved vs. information production that Biber has described. Basically, involved language is less formal and more conversation-like. It is about personal interaction and often related to a shared context. In that kind of discourse, we tend to be less precise; clauses tend to be shorter; and we use lots of pronouns: “That’s cool, am I right?

Information production is very different. It tends to be more carefully planned and, rather than promoting personal interaction, its purpose is to explain and analyze. This is what we often think of when we imagine academic prose: “This juxtaposition of radically different approaches in one book anticipated the coexistence of Naturalism and Realism in American literature of the latter half of the 19th century and the beginning of the 20th century.

Here is Biber’s list of features of the two dimensions. The red highlights are those features that are more common in lower scoring exams, and the blue the more common ones in the higher scoring exams:


The patterns are very clear. The lower scoring exams look much more like involved discourse and the high scoring ones more like informational discourse.

This is not entirely a surprise. Mary Schleppegrell, for example, has argued a number of times how important information management is to successful academic writing for students. What she means is that students need to control how they grammatically put together their ideas.

Let’s take a look at a sample sentence from a high scoring exam:


In the example, the head nouns are all in red. Information that is added before the noun (like the adjective simple) is in green. Information that is added after the noun (like the prepositional phrase of the cake) is in blue. The pronoun they and the predicative adjective simple are in orange.

Information is communicated through nouns, and is elaborated using adjectives, prepositional phrases, and other structures. This is why we see these features to a much greater degree in those higher scoring essays. In the example, the student is able to communicate a lot of information! That information connects specific textual features from the short story (capitalization and simple description) to the student’s reading of a character’s emotional state.

We can also begin to see why there might be more verbs in the lower scoring essays. The subjects and objects of their clauses are less elaborated. In other words, their clauses are shorter. So the frequency of verbs is much higher.

Another interesting result from our research is the relationship between essay length and score. This is an issue that has received some coverage particularly regarding the (soon to be retired) essay portion of the SAT. We found that there is a correlation between essay length and score (0.255), but it is not a particularly strong one. My guess is that this correlation arises not only because the more successful essays have more to say, but also because they say it in a more elaborated way.

If you are interested, you can see all of our data here.


13 Nov
November 13, 2013

Radiolab recently had an interesting interview with Daniel Engber. In it, Engber discusses the rise and fall of quicksand as a device in movies. (You can also read his Slate article on the same subject). His data shows a clear peak in the percentage of movies using quicksand in the 1960s (chart via Radiolab):


Engber suggests that the decline in quicksand in movies coincides with a decline in children’s fear of it. It has lost its allure as an image of terror. I was curious, then, if a similar pattern is evident in language. COCA shows the following distribution for quicksand over time:


The data seem to show an earlier peak (in the 1900s and 1910s), followed by a dip then a steady decline after 1940. One potentially complicating factor, here, is that quicksand is a relatively rare word, which could affect some of the fluctuations. So I thought this would be a good opportunity to try out BYU’s version of Google Book’s data, which Mark Davies discussed at this year’s Studies in the History of the English Language conference. The results are similar, but with a more clearly defined rise and fall:


As with the COCA distribution, the Google data suggest a peak in language use that precedes the peak in movies. The data also show a move toward metaphorization in recent decades. Early in the twentieth century, examples are typically like this one:

There was a ford directly opposite the cantonment, and another, more dangerous, and known to only a few, three miles farther up stream. Keeping well within the water’s edge, so as to thus completely obscure their trail, yet not daring to venture deep for fear of striking quicksand, the plainsman sent his pony struggling forward, until the dim outline of the bank at his right rendered him confident that they had attained the proper point for crossing.

Later, however, we find examples like this one:

Only their husbands believed the kids were their own. He missed his mother, too, whose quicksand love he’d wanted so badly to escape.

Or this one:

It may well be that in responding to recent Congressional language the N.E.A. has begun to have a chilling effect on art in the United States and it may be entering the quicksand of censorship.

Not that quicksand  as a physical entity disappears completely in more recent discourse:

Georgie was not popular in the swamp; in fact, the other members of the colony that inhabited this stretch of muck and quicksand, black water and scum, had banished him to the very fringe of the community.

While such examples exist, they are far fewer than the type in which some physical sensation is compared to moving through quicksand or something (like love or censorship) is compared to the ensnaring effects of quicksand, itself. Thus, the use of quicksand is not only declining, it is also undergoing a shift in meaning.

Which brings us back to our  earlier question: Why does quicksand peak earlier in language than in film? One possibility might be related to the relative durability of particular genres in different media. Adventure stories (like dime novels) had their greatest popularity in the early twentieth century. This genre appears to be an enthusiastic employer of quicksand as a conventional obstacle and threat, and the decline of the genre coincides with the decline of the word. How the rise of quicksand as a cinematic device relates to the rise of particular kinds of movies, I’m not sure, but the relationships among Engber’s data and the linguistic data pose some compelling questions.

Dropping the “I-Word”

04 Apr
April 4, 2013

The Associated Press announced that they are changing their style guide to drop the phrase “illegal immigrant” while retaining phrases like “illegal immigration” and “entering the country illegally.”

Immigrant advocates have been fighting for this change for a long time. Part of the thinking is that the phrase illegal immigrant is essentializing; it links the concept of illegality to the people themselves.

A quick look a COCA shows that in American discourse generally, immigrants are conventionally represented as illegal.

collocatefrequencymutual information
Collocates for lemmatized immigrant on COCA using a span of 4 to the left and right

It is telling, not only that the frequency of illegal is so much higher, but also that its Mutual Information value is so high. Mutual Information is a measure of association. In a corpus, words might have a high frequency of collocation because both words are themselves frequent. Less frequent words would have lower frequency but have a high degree of association: when they appear, they appear together. Mutual Information accounts for these variations and gives us a measure of association: How likely are these words to appear in the same neighborhood? The usual cut-off for significance is MI>3.

So illegal and immigrant (MI=8.30) have a very high degree of association. This is somewhat surprising given that illegal doesn’t seem like a particularly specialized modifier. Undocumented has a higher degree of association (MI=9.42). It’s use, however, is far more restricted. It appears only with nouns related to the movement of people across borders: workers, students, aliens, people, migrants, etc.

Another interesting feature of this debate is how recent the practice of representing immigrants as illegal is. Despite a long history of contentious discourse around immigration in the US, the results from COHA show that this specific construction is quite new.


For an analysis of representations of immigrants and immigration in Britain, see Gabrielatos and Baker (2008).

Identity and the Synthesized Voice

22 Dec
December 22, 2012

On the BBC’s “Click” podcast last week, they discussed evolving voice technologies, and their increasing ability to incorporate complex prosodic and phonological properties like emotional emphasis and accent. It’s an interesting conversation and worth listening to, but the segment that caught my attention touched on Stephen Hawking and the preservation of legacy technologies related to his voice:

      1. Stephen Hawking's Synthesized Voice

A 2001 Deborah Cameron article on “designer voices” provides some interesting context for Hawking’s choices regarding his synthesized voice:

Hawking faces an unusual choice here. It may seem obvious that a British accent would be more ‘authentic’ than the American one he has had to make do with up until now; this is a question of what linguists call ‘social indexicality’, the ability of the speaking voice to point to social characteristics like age, gender, class, ethnicity or, most saliently in this case, national origin. Yet voices are also privileged markers of individual identity. It would not be unreasonable for Hawking to take the view that his synthesised American voice essentially is his voice, in the sense that it is instantly recognisable as the voice of Stephen Hawking (indeed, so recognisable that Hawking can earn money doing advertising voice-overs). Some media reports noted that the scientist was having difficulty deciding whether to make the shift: he was quoted as saying that ‘it will bring a real identity crisis’.

There are a couple of issues here that I think are compelling. The first is one that both the podcast and Cameron note: Hawking’s choice speaks to the complex and powerful connection between our voice and our identity(s). In particular, it suggests that such a relationship can be an evolving one. In his early encounters with his synthesized voice, Hawking experiences a kind of distance from its American-ness. However, he ultimately chooses to retain it in the face of other possibilities. And there is no question how recognizable his voice is, inspiring a raft of amateur and professional parodies.

The other interesting issue is what that choice has meant technically. It has necessitated the preservation of legacy technologies in order maintain a powerful signifier of self. Which makes me wonder. As technologies continue to emerge that connect those technologies to our biologies, our identities and the signals we use to advertise those identities will change, too. Those inter-related changes, technological and indexical, may not move at the same pace, however. What might this mean? Could a voice, an identity, face a forced obsolescence?

Understanding Keyness

30 Nov
November 30, 2012

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called keyness.

Imagine two highly simplified corpora. Each contains only 3 different words (cat, dog, and cow) and has a total of 100 words. The frequency counts are as follows:

  • Corpus A: cat 52; dog 17; cow 31
  • Corpus B: cat 9; dog 40; cow 31

Cat and dog would be key, as they are distributed differently across the corpora, but cow would not as its distribution is the same. Put another way, cat and dog are distinguishing features of the corpora; cow is not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a chi-squared test ( or χ² test) and log-likelihood. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word lovely. Here is their data:

 lovelyAll other wordsTotal words
Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

 \Chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}

O is the observed frequency, and E is the expected frequency if the independent variable (in this case sex) had no effect on the distribution of the word. The \sum \!\, is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: observed minus expected squared, divided by expected for (1) lovely used by males; (2) other words used by males; (3) lovely used by females; and (4) other words used by females.

In other words, our main calculations are for the values in red; we have a 2×2 contingency table. The totals–the peripheral table values in green–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the row total (R) times the column total (C) divided by the number of words in the corpus (N).

The expected value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

 lovelyAll other wordsTotal words
Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC

Now we can finish our calculations. For each of our table cells, we need to subtract the expected frequency from the observed frequency; multiply that value by itself; then divide the result by the expected frequency. The calculations for each cell would look like this:

 lovelyAll other wordsTotal words
Males((414 - 647.91) * (414 - 647.91)) / 647.91 ((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.091714443
Females((1214 - 980.09) * (1214 - 980.09)) / 980.09((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.912593452
Table 3: Calculating the chi-square values for our frequencies

When we complete those calculations, our contingency table looks like this:

 lovelyAll other wordsTotal words
Table 4: Chi-square values for our frequencies

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

  • Pearson Chi-Square Calculator
  • Chi-Square

Now the question is: What does this number tell us? Typically, we determine the significance of keyness values in one of two ways. First, sometimes corpus linguists just look at the top key words (maybe the top 20) to explore the most marked differences between two corpora. Second, we can find the p-value.

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, df = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

Table 5: Chi-square distribution table showing critical p-values for df = 1

The critical cutoff point for statistical significance is usually at p<.01 (though it can also be p<.05). So a chi-square value above 6.63 would be considered significant. Our value is 140.3, so the distribution of lovely is highly significant (p<.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

I  want to leave you with a couple of tips, questions, and resources:

  1. Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness.
  2. Typically, our chi-square tests in corpus linguistics will involve a 2×2 contingency table (with a degree of freedom of one); however, this isn’t always the case. We might be interested in, for example, distributions of multiple spellings (e.g., because, cause, cuz, coz) that would involve higher degrees of freedom.
  3. Why would comparing larger corpora tend to produce results with larger chi-square (or keyness) values?
  4. Anatol Stefanowitsch has an excellent entry on chi-square tests here.
  5. There is a nice chi-square calculator here.

Normalizing Word Counts

16 Nov
November 16, 2012

One of the things we often do in corpus linguistics is to compare one corpus (or one part of a corpus) with another. Let’s imagine an example. We have 2 corpora: Corpus A and Corpus B. And we’re interested in the frequency of the word boondoggle. We find 18 occurrences in Corpus A and 47 occurrences in Corpus B. So we make the following chart:

 The problem here is that unless Corpus A and Corpus B are exactly the same size this chart is misleading. It doesn’t accurately reflect the relative frequencies in each corpus. In order to accurately compare corpora (or sub-corpora) of different sizes, we need to normalize their frequencies.

Let’s say Corpus A contains 821,273 words and Corpus B contains 4,337,846 words. Our raw frequencies then are:

Corpus A = 18 per 821,273 words

Corpus B = 47 per 4,337,846 words

To normalize, we want to calculate the frequencies for each per the same number of words. The convention is to calculate per 10,000 words for smaller corpora and per 1,000,000 for larger ones. The Corpus of Contemporary English, for example, uses per million calculations in the chart display for comparisons across text-types.

Calculating a normalized frequency is a fairly straightforward process. The equation can be represented in this way:

We have 18 occurrences per 821,273 words, which is the same as x (our normalized frequency) per 1,000,000 words. We can solve for x with simple cross multiplication:

Generalizing then (normalizing per one-million words):

You can use the calculator below to see how this works. Just input the relevant numbers without commas.

  • Normalizing Calculator (do not use comma separators)

For our example, we can see how this affects the representation of our data:

The raw frequencies seemed to suggest that boondoggle appeared more than 2.5 times more in Corpus B. The normalized frequencies, however, show that boondoggle is actually twice as frequent in Corpus A.

A Twitter Post

18 Oct
October 18, 2012

I came across an interesting use of Twitter as a modifier in an article in the New York Times. In describing a proposed development of some very small apartments in San Francisco, the article states:

Opponents of the legislation have even taken to derisively calling the micro-units “Twitter apartments.”

Of course the process by which brand names become common nouns or verbs is well-known: frisbee, xerox, band-aid, kleenex, hoover, google, etc. It is possible that this is the leading of edge of a similar semantic shift whereby Twitter becomes a modifier meaning “very small” or “tiny.” I tried to find similar uses of Twitter but had no success. I first tried checking common nouns that collocate with tiny (apartment by the way is the 8th most frequent collocate). Kitchen and house seem likely possibilities, but I have yet to find any constructions in which Twitter kitchen means tiny kitchen, for example. As of yet, this use of Twitter appears to be an isolated neologism.

Verbing Names

02 Sep
September 2, 2012

You’ve probably seen or at least heard about Clint Eastwood’s speech at the Republican convention. It became noteworthy for his address to an imagined Obama represented by an empty chair. What is interesting to me linguistically, is what happened afterward: Eastwooding became a meme. This is the newest example (putting aside any potential queasiness about the implications of the mass mocking of an elderly person) of a process of verbing names. Of course, the other recent, notable example is Tebowing.

In these instances, a celebrity’s name stands in for a specific act (talking to an empty chair or kneeling in prayer), the –ing affix is added to the name, and the spread of the linguistic form is often accompanied by a visual representation (either a photo or a video). And the linguistic form (as –ing forms do) can function as a noun as it does in this headline:

Tebowing‘ makes transition from Internet meme to race horse

But can also be a verb in the progressive aspect as it is here:

I feel like I am Eastwooding on SS [Sweet Shangai] recently

The word formation process that makes verbs out of proper nouns isn’t new. From COHA, here’s an example of Xerox being used as a verb from 1969:

…given their obvious merit and high level of lubricity, to have them xeroxed while they were in my possession,

A well known example of an individual whose name became verbed is the basketball player Kevin Pitsnoggle. His name came to mean being beaten by an unlikely opponent or specifically by the 3-point shot:

In a dramatic twist of irony, West Virginia was pittsnogled last night.

Unlike Tebowing and Eastwooding, pittsnogled usually appears in the past participle (-ed form) and in the passive. These two two patterns (-ing and passive) seem to be the most frequent patterns of formation.

Part of what I think is interesting is that while there are certainly pre-Internet examples, I think the process of verbing the names of individuals is at least accelerating, particularly in computer-mediated spaces.

The names that are verbed seem to be (not surprisingly) celebrities of various stripes (actors, politicians, athletes). So, for example, U.S. Presidents are frequently verbed:

Reagan Coolidged Kennedy

Well, I’ve been Reaganed I suppose.

And then the unemployment benefits will start to run out for people, like me, where [sic] were Obamaed

There’s a reason the audio has been Nixoned.

For fun, I did some quick searches on COHA to see if there is any historical precedent for these kinds of uses, but couldn’t come up with any. Also, as the above example of Coolidged illustrates, the names that are verbed can be historical not just contemporary. So while you can go Eastwooding, you can also be Marie Antoinetted:


Earning Our Bread

20 Aug
August 20, 2012

A radio show in the US that focuses on economic issues, published something last week they call “Money slang: Marketplace’s urban finance dictionary.” The dictionary raises a few issues that I think are interesting and one that I want to check out using COHA.

The first interesting issue is what they choose to include. They use “urban” in the title and have a quite a few words that either emerge from or are popularized in hip hop and rap like Benjamins (Sean Combs and Notorious B.I.G.) or bling (B.G.). Then we have tuppence, which is not only not from hip hop (obviously) but doesn’t seem to me to be particularly slangy (from the OED):

The hodgepodge nature of the list raises the issue of how the compilers are defining “slang.” Slang is a notoriously messy category. Is it the newness of the words that matter? Or is it the community of speakers that matter–namely youth? Or is it the words’ use for social grouping rather than something like work or eluding authorities? All of these are questions that don’t necessarily have easy answers. If you’re interested in such things, you should check out Adams.

The other issue that I wanted to explore in more detail has to do with the purported origins of some of these words. Determining the etymology of slang is tough. Particularly before computer-mediated communication, slang was primarily coined and circulated in spoken discourse, which of course leaves little historical record. We rely, then, on written records, and by the time a slang word makes it into print, it has likely been circulating elsewhere for a while.

A related problem is that popular treatments like this article sometimes circulate false (or folk) etymologies. And there is an example here that I wanted to check out because the cited word origin seemed unlikely and because I thought it posed a challenging problem for using corpora.

The entry that caught my eye was for bread. Its says this:

Bread — May have originated from jazz great Lester Young. He is said to have asked “How does the bread smell?” when asking how much a gig was going to pay.

The authors have taken care to hedge their contention with the modal may, but the story of Lester Young struck me as perhaps apocryphal. So I thought it would be fun to investigate.

Bread as slang for money is a challenging search as its meaning as food is much more common. So we need to find ways to sort out the different uses. One strategy is to think about how the different uses might have different collocational patterns. One such pattern occurs with verbs. Unfortunately, one common verb, make, isn’t of much help, since you can “make bread” in both senses. As an alternative, I thought I’d try earn. A search for bread collocating with earn within four tokens to the left or right yields this result:

Most of the concordance lines are of this type from 1856:

workshops at the expense of the state, in which able-bodied citizens could earn their bread. Thus the people were taxed exorbitantly to maintain a costly and cumbersome and corrupting

This use is somewhat different from the slang use. It has a more generalized meaning of sustenance (as opposed to the specific meaning of money), and likely comes from Biblical metaphor “daily bread.” However, the pattern of collocation with earn and with work-related terms seems like a short semantic leap to the more restricted meaning of money, as it does in this concordance line from 1959:

I mean I ought to be out getting a job, man. Earning some bread for the old lady. Got to have money, got to have a job

My guess is that the slang is the result of semantic narrowing from an already existing metaphorical use of bread. But like I said, documenting the precise origin isn’t easy.