1. busterpaparazzi.com - http://cs.busterpaparazzi.com/track/dmitrty12.1.3.4.0.0.0.0.0 busterpaparazzi.com
  2. cumintomyass.com - http://www.cumintomyass.com/cum1.html?nats=MTExMDMuMi4yMjYuMjI2LjAuMC4wLjAuMA cumintomyass.com
  3. gayviolations.com - http://www.gayviolations.com/galleries/vb03/vb03_publictrollsharkingduo/index.html?nats=watchit:gayprog:GVS,0,0,0,2408 gayviolations
  4. Big Tits Ex Gf - http://www.bigtitsexgf.com/t1/cfree=vix/index.html bigtitsexgf.com
  5. real teenies - http://realteeniegfs.com/tour3/?nats=MjIyOTk4Ny4xLjQzMS44MjcuMC4wLjAuMC4w real teenies
  6. celebbusters.com - http://www.celebbusters.com/tour2/celebrities.php?nats=dmitrty12.1.1.1.0.0.0.0.0 celebbusters
  7. celebsdungeon - http://cs.celebsdungeon.com/track/dmitrty12.1.9.23.0.0.0.0.0 celebsdungeon.com
  8. Buk Buddies - http://track.bukbuddies.com/signup/signup.php?nats=MTExMDMuMi40MDEuNzAxLjAuMC4wLjAuMA&step=2&tpl=join3 bukbuddies.com
  9. blondeexgfs.com - http://www.blondeexgfs.com/t1/cfree=vix/index.html blondeexgfs.com
  10. biarchive.com - http://track.biarchive.com/signup/signup.php?nats=MTExMDMuMi44LjguMC4wLjAuMC4w biarchive.com
  11. cumshotsurprise.com - http://www.cumshotsurprise.com/galleries/fb04/fb04_nexttopslut/index.html?nats=watchit:18yorevshare:CSP,0,0,0, cumshotsurprise
  12. Massage Room Seduction - http://free.massageroomseduction.com/track/MzU4MzoyNjoxNw/?tour=tube Massage Room Seduction
  13. cumshotsurprise - http://join.cumshotsurprise.com/track/watchit:18yorevshare:CSP,0,0,0,/ cumshotsurprise
  14. dads fuck boys - http://www.dadsfuckboys.com/front.php?nats=MTExMDMuMi4zMi4zMi4wLjAuMC4wLjA dads fuck boys
  15. bathhousebait.com - http://www.bathhousebait.com/galleries/vb01/vb01_threecocksforfun/index.html?nats=MTkyMTU6NDoyMg,0,0,0,2711 Bath House Bait
  16. celebbusters - http://cs.celebbusters.com/track/dmitrty12.1.1.1.0.0.0.0.0 celeb busters
  17. bathhousebait - http://www.bathhousebait.com/tour1/?nats=MTkyMTU6NDoyMg,0,0,0,0 bathhousebait
  18. bannedteencelebs.com - http://track.bannedteencelebs.com/track/MjIyOTk4Ny4xLjY2Mi42NjIuMC4wLjAuMC4w bannedteencelebs.com
  19. czeh casting - http://join.czechcasting.com/gallery/dmitrty12.1.8.13.0.66.0.0.0 czechcasting.com
  20. gay sex - http://secure6.vend-o.com/signup/signup.php?nats=MTkyMTU6NDoyMg,0,0,0,2711&step=2&tpl=join4 Bath House Bait
  21. cockcompetition - http://www.cockcompetition.com/galleries/pb01/pb01_katarina_ami/index.html?nats=watchit:onebucktrial:cockcomp,0,0,0,14927 cockcompetition.com
  22. celeb busters - http://www.celebbusters.com/tour2/?nats=dmitrty12.1.1.1.0.0.0.0.0 celeb busters
  23. cumdisgrace - http://www.cumdisgrace.com/tour1/girls.php?nats=watchit:18yorevshare:CMD,0,0,0,0 cumdisgrace.com
  24. celebsdungeon - http://www.celebsdungeon.com/mobile/?nats=dmitrty12.1.9.25.0.0.0.0.0 celebs dungeon
  25. 40ozbounce.com - http://www.40ozbounce.com/galleries/pb07/pb07_briannablair_mrsdesire/index.html?nats=watchit:onebucktrial:fourty,0,0,0,24120 40ozbounce
  26. mommylovescock - http://track.mommylovescock.com/track/MjIyOTk4Ny4xLjE2LjE2LjAuMC4wLjAuMA mommylovescock.com
  27. banned celebs - http://track.bannedcelebs.com/track/rhd-123456.1.593.2851.0.0.0.0.0 bannedcelebs
  28. Big Cock Crew - http://track.bigcockcrew.com/track/MjIyOTk4Ny4xLjguOC4wLjAuMC4wLjA bigcockcrew.com
  29. damnthatsbig - http://join.damnthatsbig.com/track/MTkyMTU6NDo5/ damnthatsbig
  30. gayviolations - http://join.gayviolations.com/track/MTkyMTU6NDo4/ gayviolations
  31. cumdisgrace - http://www.cumdisgrace.com/galleries/vb03/vb03_malloryrae2/index.html?nats=watchit:18yorevshare:CMD,0,0,0, cumdisgrace
  32. outhim.com - http://www.outhim.com/galleries/vb01/vb01_dylanhauser/index.html?nats=watchit:gayprog:OHM,0,0,0,1509 outhim.com
  33. banned celebs - http://bannedcelebs.com/tube2/?nats=rhd-123456.1.593.2851.0.0.0.0.0 banned celebs
  34. thickandbig - http://join.thickandbig.com/track/watchit:gayprog:TnB,7/ thick and big
  35. Mature Ex Gf Mobile - http://m.matureexgf.com/t1/mpps=vix matureexgf.com
  36. bannedcelebs - http://bannedcelebs.com/tube2/video.php?id=34&nats=rhd-123456.1.593.2851.0.0.0.0.0 bannedcelebs
  37. gaycastings.com - http://join.gaycastings.com/track/MTkyMTU6NDoyNg/ gaycastings.com
  38. Sexy Teen Ex Gf Mobile - http://m.sexyteenexgf.com/t1/mpps=vix sexyteenexgf.com
  39. extreme bisexual - http://www.extremebisexual.com/t1/pps=vix/ extreme bisexual
  40. banned celebs - http://track.bannedcelebs.com/gallery/rhd-123456.0.593.593.0.7000198.0.0. banned celebs
  41. Damn That's Big - http://www.damnthatsbig.com/tour1/?nats=MTkyMTU6NDo5,0,0,0,0 Damn That's Big
  42. gay room mobile - http://join.gayroommobile.com/track/MTkyMTU6MTM6MTQ/ gayroommobile.com
  43. exposed teen celebs - http://www.exposedteencelebs.com/tour1/set.php?b=20797&nats=dmitrty12 exposedteencelebs
  44. Mature Ex Gf - http://www.matureexgf.com/t1/cfree=vix/index.html Mature Ex Gf
  45. officecock - http://www.officecock.com/galleries/vb02/vb02_officebigdickseduction/index.html?nats=watchit:gayprog:OCK,0,0,0,2717 officecock.com
  46. celebbusters.com - http://cs.celebbusters.com/signup/signup.php?nats=dmitrty12.1.1.1.0.0.0.0.0&step=2 celebbusters
  47. 3dfan.xxx - http://cs.3dfan.xxx/track/dmitrty12.1.11.27.0.0.0.0.0 3dfan.xxx
  48. crueltyparty.com - http://www.crueltyparty.com/galleries/pb03/pb03_saints_sinners/index.html?nats=watchit:onebucktrial:CRP,0,0,0,27731 crueltyparty
  49. Sexy BBW Ex Gf - http://www.sexybbwexgf.com/t1/cfree=vix/index.html Sexy BBW Ex Gf
  50. cumdisgrace.com - http://cumdisgrace.com/trailer_colbymcadams.php?nats=watchit:18yorevshare:CMD,0,0,0,0 cumdisgrace
  51. Banned Male Celebs - http://track.bannedmalecelebs.com/track/MjIyOTk4Ny4xLjY5Mi42OTIuMC4wLjAuMC4w bannedmalecelebs.com
  52. gay violations - http://www.gayviolations.com/galleries/vb01/vb01_assbanditcarwashstrike/index.html?nats=watchit:gayprog:GVS,0,0,0,2393 gayviolations.com
  53. Massage Girls 18 - http://join.massagegirls18.com/track/MzU4MzoyNjoxNA,0,0,0,/ massagegirls
  54. reallycelebs - http://access.reallycelebs.com//signup/signup.php?nats=rhd-123456.1.651.651.0.0.0.0.0&step=2&qualify=1 reallycelebs
  55. badboypass.com - http://www.badboypass.com/front.php?&path=&nats=gg-tgbhu9.31.354.354.0.0.0.0.0 badboypass
  56. amateurbangers.com - http://www.amateurbangers.com/t1/cfree=vix/ amateurbangers.com
  57. officecock.com - http://officecock.com/tour2/guys.php?nats=MTkyMTU6NDoyMA officecock
  58. outhim - http://www.outhim.com/tour1/trailer_a.php?nats=MTkyMTU6NDoy outhim.com
  59. Milf Ex Gf - http://www.milfexgf.com/t1/cfree=vix/index.html milfexgf.com
  60. matureexgf - http://www.matureexgf.com/t1/pps=vix/ matureexgf
  61. Milf Ex Gf Mobile - http://m.milfexgf.com/t1/mpps=vix Milf Ex Gf Mobile
  62. allgaysitespass - http://www.allgaysitespass.com/t3/cfree=vix/ allgaysitespass
  63. boysdestroyed.com - http://www.boysdestroyed.com/galleries/vb01/vb01_alexandergarrett/index.html?nats=MTkyMTU6NDoxOA,0,0,0,2220 boysdestroyed.com
  64. cumshotsurprise - http://www.cumshotsurprise.com/tour1/trailer_meganpiper.php?nats=watchit:18yorevshare:CSP,0,0,0,0 cumshotsurprise
  65. damn thats big - http://secure6.vend-o.com/signup/signup.php?nats=MTkyMTU6NDo5,0,0,0,3091&step=2 Monster Gay Cocks
  66. allpornsitespass - http://www.allpornsitespass.com/t1/cfree=vix/ allpornsitespass
  67. sexybbwexgf.com - http://m.sexybbwexgf.com/t1/mpps=vix sexybbwexgf.com
  68. bigtitsexgf.com - http://m.bigtitsexgf.com/t1/mpps=vix Big Tits Ex Gf Mobile
  69. Seduced by Massage - http://join.seducedbymassage.com/track/MzU4MzoyNjoyNA/ seducedbymassage.com
  70. czechcasting - http://join.czechcasting.com/gallery/ODI4LjEuOC4xMy4wLjI1NTYuMC4wLjA czechcasting
  71. biarchive.com - http://track.biarchive.com/track/MTExMDMuMi44LjguMC4wLjAuMC4w biarchive
  72. realteeniegfs - http://track.realteeniegfs.com/track/MjIyOTk4Ny4xLjQzMS40MzEuMC4wLjAuMC4w realteeniegfs
  73. celebs dungeon - http://www.celebsdungeon.com/tour1/tour.php?nats=dmitrty12.1.9.23.0.0.0.0.0 celebs dungeon
  74. Revenge Ex Girlfriend Mobile - http://m.revengeexgirlfriend.com/t1/mpps=vix revengeexgirlfriend.com
  75. gaycastings - http://gaycastings.com/galleries/fb02/gcc_josh/index.html?nats=watchit:gayprog:gaycastings,0,0,0,2979 gaycastings.com
  76. celebdreamer.com - http://cs.celebdreamer.com/track/dmitrty12.1.8.17.0.0.0.0.0 celebdreamer
  77. motherdaughterfuck.com - http://track.motherdaughterfuck.com/track/MjIyOTk4Ny4xLjEwOS4xMDkuMC4wLjAuMC4w Mother Daughter Fuck
  78. cum disgrace - http://join.cumdisgrace.com/strack/watchit/34/0/01/ cumdisgrace
  79. revengeexgirlfriend.com - http://www.revengeexgirlfriend.com/t1/cfree=vix/index.html revengeexgirlfriend.com
  80. MassageGirls18 - http://join.massagegirls18.com/track/MzU4MzoyNjoxNA/ MassageGirls18
  81. Sexy Teen Ex Gf - http://www.sexyteenexgf.com/t1/cfree=vix/index.html sexyteenexgf.com
  82. big tit sex gf - http://www.bigtitsexgf.com/t1/pps=vix/join2.html big tits ex gf
  83. Blonde Ex Gfs Mobile - http://m.blondeexgfs.com/t1/mpps=vix blondeexgfs.com
  84. Boysdestroyed mobile - http://joinm.boysdestroyed.com/track/MTkyMTU6MTM6MTk/ Boysdestroyed mobile
  85. celebdreamer - http://www.celebdreamer.com/tour1/?nats=dmitrty12.1.8.17.0.0.0.0.0 celeb dreamer
  86. RealTeenieGFs - http://track.realteeniegfs.com/track/MjIyOTk4Ny4xLjQzMS44MjcuMC4wLjAuMC4w realteeniegfs.com
  87. exposedteencelebs.com - http://www.exposedteencelebs.com/tour1/set.php?b=20709&nats=dmitrty12 exposedteencelebs.com
  88. CumIntoMyAss - http://track.CumIntoMyAss.com/track/MTExMDMuMi4yMjYuODExLjAuMC4wLjAuMA CumIntoMyAss
  89. nudemalecelebs - http://www.nudemalecelebs.xxx/tour1/?nats=dmitrty12.1.10.26.0.0.0.0.0 nudemalecelebs
  90. Damn That's Big - http://damnthatsbig.com/tour2/?nats=MTkyMTU6NDo5,1,0,0,0 damnthatsbig
  91. czech parties - http://join.czechparties.com/strack/dmitrty12.1.8.13.0.66.0.0.0/3:4/0/1/ czechparties.com
  92. thickandbig.com - http://thickandbig.com/tours/1/?nats=watchit:gayprog:TnB,7,0,0,0 thickandbig.com
  93. exposedontape - http://cs.exposedontape.com/track/dmitrty12.1.5.6.0.0.0.0.0 Exposed On Tape
  94. Celeb Defamer - http://track.celebdefamer.com/track/MjIyOTk4Ny4xLjU5NS41OTUuMC4wLjAuMC4w celebdefamer.com
  95. bukbuddies - http://track.bukbuddies.com/track/MTExMDMuMi40MDEuNzAxLjAuMC4wLjAuMA bukbuddies
  96. bathhousebait.com - http://join.bathhousebait.com/track/MTkyMTU6NDoyMg/ bathhousebait.com
  97. bad boy pass - http://track.badboypass.com/track/gg-tgbhu9.31.354.354.0.0.0.0.0 badboypass
  98. collegefuckfest.com - http://track.collegefuckfest.com/track/MjIyOTk4Ny4xLjEuMjQ4Mi4wLjAuMC4wLjA college fuck fest
  99. blackcelebsonly.com - http://track.blackcelebsonly.com/track/MjIyOTk4Ny4xLjUxMS41MTEuMC4wLjAuMC4w blackcelebsonly.com
  100. damnthatsbig.com - http://www.damnthatsbig.com/galleries/fb01/fb01_hunterpage/index.html?nats=MTkyMTU6NDo5,0,0,0,3091 damnthatsbig
  101. gaycastings.com - http://gaycastings.com/?nats=MTkyMTU6NDoyNg,0,0,0,0 gaycastings.com
  102. Bare Foot Fuckers - http://www.barefootfuckers.com/t2/pps=vix/ barefootfuckers

Quicksand

13 Nov
November 13, 2013

Radiolab recently had an interesting interview with Daniel Engber. In it, Engber discusses the rise and fall of quicksand as a device in movies. (You can also read his Slate article on the same subject). His data shows a clear peak in the percentage of movies using quicksand in the 1960s (chart via Radiolab):

quicksand_movies_graph_620

Engber suggests that the decline in quicksand in movies coincides with a decline in children’s fear of it. It has lost its allure as an image of terror. I was curious, then, if a similar pattern is evident in language. COCA shows the following distribution for quicksand over time:

quicksand_coha

The data seem to show an earlier peak (in the 1900s and 1910s), followed by a dip then a steady decline after 1940. One potentially complicating factor, here, is that quicksand is a relatively rare word, which could affect some of the fluctuations. So I thought this would be a good opportunity to try out BYU’s version of Google Book’s data, which Mark Davies discussed at this year’s Studies in the History of the English Language conference. The results are similar, but with a more clearly defined rise and fall:

quicksand_google

As with the COCA distribution, the Google data suggest a peak in language use that precedes the peak in movies. The data also show a move toward metaphorization in recent decades. Early in the twentieth century, examples are typically like this one:

There was a ford directly opposite the cantonment, and another, more dangerous, and known to only a few, three miles farther up stream. Keeping well within the water’s edge, so as to thus completely obscure their trail, yet not daring to venture deep for fear of striking quicksand, the plainsman sent his pony struggling forward, until the dim outline of the bank at his right rendered him confident that they had attained the proper point for crossing.

Later, however, we find examples like this one:

Only their husbands believed the kids were their own. He missed his mother, too, whose quicksand love he’d wanted so badly to escape.

Or this one:

It may well be that in responding to recent Congressional language the N.E.A. has begun to have a chilling effect on art in the United States and it may be entering the quicksand of censorship.

Not that quicksand  as a physical entity disappears completely in more recent discourse:

Georgie was not popular in the swamp; in fact, the other members of the colony that inhabited this stretch of muck and quicksand, black water and scum, had banished him to the very fringe of the community.

While such examples exist, they are far fewer than the type in which some physical sensation is compared to moving through quicksand or something (like love or censorship) is compared to the ensnaring effects of quicksand, itself. Thus, the use of quicksand is not only declining, it is also undergoing a shift in meaning.

Which brings us back to our  earlier question: Why does quicksand peak earlier in language than in film? One possibility might be related to the relative durability of particular genres in different media. Adventure stories (like dime novels) had their greatest popularity in the early twentieth century. This genre appears to be an enthusiastic employer of quicksand as a conventional obstacle and threat, and the decline of the genre coincides with the decline of the word. How the rise of quicksand as a cinematic device relates to the rise of particular kinds of movies, I’m not sure, but the relationships among Engber’s data and the linguistic data pose some compelling questions.

Dropping the “I-Word”

04 Apr
April 4, 2013

The Associated Press announced that they are changing their style guide to drop the phrase “illegal immigrant” while retaining phrases like “illegal immigration” and “entering the country illegally.”

Immigrant advocates have been fighting for this change for a long time. Part of the thinking is that the phrase illegal immigrant is essentializing; it links the concept of illegality to the people themselves.

A quick look a COCA shows that in American discourse generally, immigrants are conventionally represented as illegal.

collocate
frequency
mutual information
illegal28878.30
new10321.67
legal6494.81
other5620.80
undocumented5379.42
Collocates for lemmatized immigrant on COCA using a span of 4 to the left and right

It is telling, not only that the frequency of illegal is so much higher, but also that its Mutual Information value is so high. Mutual Information is a measure of association. In a corpus, words might have a high frequency of collocation because both words are themselves frequent. Less frequent words would have lower frequency but have a high degree of association: when they appear, they appear together. Mutual Information accounts for these variations and gives us a measure of association: How likely are these words to appear in the same neighborhood? The usual cut-off for significance is MI>3.

So illegal and immigrant (MI=8.30) have a very high degree of association. This is somewhat surprising given that illegal doesn’t seem like a particularly specialized modifier. Undocumented has a higher degree of association (MI=9.42). It’s use, however, is far more restricted. It appears only with nouns related to the movement of people across borders: workers, students, aliens, people, migrants, etc.

Another interesting feature of this debate is how recent the practice of representing immigrants as illegal is. Despite a long history of contentious discourse around immigration in the US, the results from COHA show that this specific construction is quite new.

screen-capture-1

For an analysis of representations of immigrants and immigration in Britain, see Gabrielatos and Baker (2008).

Identity and the Synthesized Voice

22 Dec
December 22, 2012

On the BBC’s “Click” podcast last week, they discussed evolving voice technologies, and their increasing ability to incorporate complex prosodic and phonological properties like emotional emphasis and accent. It’s an interesting conversation and worth listening to, but the segment that caught my attention touched on Stephen Hawking and the preservation of legacy technologies related to his voice:

Stephen Hawking's Synthesized Voice 

A 2001 Deborah Cameron article on “designer voices” provides some interesting context for Hawking’s choices regarding his synthesized voice:

Hawking faces an unusual choice here. It may seem obvious that a British accent would be more ‘authentic’ than the American one he has had to make do with up until now; this is a question of what linguists call ‘social indexicality’, the ability of the speaking voice to point to social characteristics like age, gender, class, ethnicity or, most saliently in this case, national origin. Yet voices are also privileged markers of individual identity. It would not be unreasonable for Hawking to take the view that his synthesised American voice essentially is his voice, in the sense that it is instantly recognisable as the voice of Stephen Hawking (indeed, so recognisable that Hawking can earn money doing advertising voice-overs). Some media reports noted that the scientist was having difficulty deciding whether to make the shift: he was quoted as saying that ‘it will bring a real identity crisis’.

There are a couple of issues here that I think are compelling. The first is one that both the podcast and Cameron note: Hawking’s choice speaks to the complex and powerful connection between our voice and our identity(s). In particular, it suggests that such a relationship can be an evolving one. In his early encounters with his synthesized voice, Hawking experiences a kind of distance from its American-ness. However, he ultimately chooses to retain it in the face of other possibilities. And there is no question how recognizable his voice is, inspiring a raft of amateur and professional parodies.

The other interesting issue is what that choice has meant technically. It has necessitated the preservation of legacy technologies in order maintain a powerful signifier of self. Which makes me wonder. As technologies continue to emerge that connect those technologies to our biologies, our identities and the signals we use to advertise those identities will change, too. Those inter-related changes, technological and indexical, may not move at the same pace, however. What might this mean? Could a voice, an identity, face a forced obsolescence?

Understanding Keyness

30 Nov
November 30, 2012

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called keyness.

Imagine two highly simplified corpora. Each contains only 3 different words (cat, dog, and cow) and has a total of 100 words. The frequency counts are as follows:

  • Corpus A: cat 52; dog 17; cow 31
  • Corpus B: cat 9; dog 40; cow 31

Cat and dog would be key, as they are distributed differently across the corpora, but cow would not as its distribution is the same. Put another way, cat and dog are distinguishing features of the corpora; cow is not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a chi-squared test ( or χ² test) and log-likelihood. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word lovely. Here is their data:

 
lovely
All other words
Total words
Males41417140291714443
Females121425922382593452
Total162843062674307895
Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

 \Chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}

O is the observed frequency, and E is the expected frequency if the independent variable (in this case sex) had no effect on the distribution of the word. The \sum \!\, is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: observed minus expected squared, divided by expected for (1) lovely used by males; (2) other words used by males; (3) lovely used by females; and (4) other words used by females.

In other words, our main calculations are for the values in red; we have a 2×2 contingency table. The totals–the peripheral table values in green–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the row total (R) times the column total (C) divided by the number of words in the corpus (N).

The expected value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

 
lovely
All other words
Total words
Males647.911713795.091714443
Females980.092592471.912593452
Total162843062674307895
Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC

Now we can finish our calculations. For each of our table cells, we need to subtract the expected frequency from the observed frequency; multiply that value by itself; then divide the result by the expected frequency. The calculations for each cell would look like this:

 
lovely
All other words
Total words
Males((414 - 647.91) * (414 - 647.91)) / 647.91 ((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.091714443
Females((1214 - 980.09) * (1214 - 980.09)) / 980.09((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.912593452
Total162843062674307895
Table 3: Calculating the chi-square values for our frequencies

When we complete those calculations, our contingency table looks like this:

 
lovely
All other words
Total words
Males84.450.031714443
Females55.830.022593452
Total162843062674307895
Table 4: Chi-square values for our frequencies

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

  • Pearson Chi-Square Calculator
  • Chi-Square

Now the question is: What does this number tell us? Typically, we determine the significance of keyness values in one of two ways. First, sometimes corpus linguists just look at the top key words (maybe the top 20) to explore the most marked differences between two corpora. Second, we can find the p-value.

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, df = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

.1
.05
.01
.005
.001
.0001
2.713.846.637.8810.8315.14
Table 5: Chi-square distribution table showing critical p-values for df = 1

The critical cutoff point for statistical significance is usually at p<.01 (though it can also be p<.05). So a chi-square value above 6.63 would be considered significant. Our value is 140.3, so the distribution of lovely is highly significant (p<.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

I  want to leave you with a couple of tips, questions, and resources:

  1. Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness.
  2. Typically, our chi-square tests in corpus linguistics will involve a 2×2 contingency table (with a degree of freedom of one); however, this isn’t always the case. We might be interested in, for example, distributions of multiple spellings (e.g., because, cause, cuz, coz) that would involve higher degrees of freedom.
  3. Why would comparing larger corpora tend to produce results with larger chi-square (or keyness) values?
  4. Anatol Stefanowitsch has an excellent entry on chi-square tests here.
  5. There is a nice chi-square calculator here.

Normalizing Word Counts

16 Nov
November 16, 2012

One of the things we often do in corpus linguistics is to compare one corpus (or one part of a corpus) with another. Let’s imagine an example. We have 2 corpora: Corpus A and Corpus B. And we’re interested in the frequency of the word boondoggle. We find 18 occurrences in Corpus A and 47 occurrences in Corpus B. So we make the following chart:

 The problem here is that unless Corpus A and Corpus B are exactly the same size this chart is misleading. It doesn’t accurately reflect the relative frequencies in each corpus. In order to accurately compare corpora (or sub-corpora) of different sizes, we need to normalize their frequencies.

Let’s say Corpus A contains 821,273 words and Corpus B contains 4,337,846 words. Our raw frequencies then are:

Corpus A = 18 per 821,273 words

Corpus B = 47 per 4,337,846 words

To normalize, we want to calculate the frequencies for each per the same number of words. The convention is to calculate per 10,000 words for smaller corpora and per 1,000,000 for larger ones. The Corpus of Contemporary English, for example, uses per million calculations in the chart display for comparisons across text-types.

Calculating a normalized frequency is a fairly straightforward process. The equation can be represented in this way:

We have 18 occurrences per 821,273 words, which is the same as x (our normalized frequency) per 1,000,000 words. We can solve for x with simple cross multiplication:

Generalizing then (normalizing per one-million words):

You can use the calculator below to see how this works. Just input the relevant numbers without commas.

  • Normalizing Calculator (do not use comma separators)

For our example, we can see how this affects the representation of our data:

The raw frequencies seemed to suggest that boondoggle appeared more than 2.5 times more in Corpus B. The normalized frequencies, however, show that boondoggle is actually twice as frequent in Corpus A.

A Twitter Post

18 Oct
October 18, 2012

I came across an interesting use of Twitter as a modifier in an article in the New York Times. In describing a proposed development of some very small apartments in San Francisco, the article states:

Opponents of the legislation have even taken to derisively calling the micro-units “Twitter apartments.”

Of course the process by which brand names become common nouns or verbs is well-known: frisbee, xerox, band-aid, kleenex, hoover, google, etc. It is possible that this is the leading of edge of a similar semantic shift whereby Twitter becomes a modifier meaning “very small” or “tiny.” I tried to find similar uses of Twitter but had no success. I first tried checking common nouns that collocate with tiny (apartment by the way is the 8th most frequent collocate). Kitchen and house seem likely possibilities, but I have yet to find any constructions in which Twitter kitchen means tiny kitchen, for example. As of yet, this use of Twitter appears to be an isolated neologism.

Verbing Names

02 Sep
September 2, 2012

You’ve probably seen or at least heard about Clint Eastwood’s speech at the Republican convention. It became noteworthy for his address to an imagined Obama represented by an empty chair. What is interesting to me linguistically, is what happened afterward: Eastwooding became a meme. This is the newest example (putting aside any potential queasiness about the implications of the mass mocking of an elderly person) of a process of verbing names. Of course, the other recent, notable example is Tebowing.

In these instances, a celebrity’s name stands in for a specific act (talking to an empty chair or kneeling in prayer), the -ing affix is added to the name, and the spread of the linguistic form is often accompanied by a visual representation (either a photo or a video). And the linguistic form (as -ing forms do) can function as a noun as it does in this headline:

Tebowing‘ makes transition from Internet meme to race horse

But can also be a verb in the progressive aspect as it is here:

I feel like I am Eastwooding on SS [Sweet Shangai] recently

The word formation process that makes verbs out of proper nouns isn’t new. From COHA, here’s an example of Xerox being used as a verb from 1969:

…given their obvious merit and high level of lubricity, to have them xeroxed while they were in my possession,

A well known example of an individual whose name became verbed is the basketball player Kevin Pitsnoggle. His name came to mean being beaten by an unlikely opponent or specifically by the 3-point shot:

In a dramatic twist of irony, West Virginia was pittsnogled last night.

Unlike Tebowing and Eastwooding, pittsnogled usually appears in the past participle (-ed form) and in the passive. These two two patterns (-ing and passive) seem to be the most frequent patterns of formation.

Part of what I think is interesting is that while there are certainly pre-Internet examples, I think the process of verbing the names of individuals is at least accelerating, particularly in computer-mediated spaces.

The names that are verbed seem to be (not surprisingly) celebrities of various stripes (actors, politicians, athletes). So, for example, U.S. Presidents are frequently verbed:

Reagan Coolidged Kennedy

Well, I’ve been Reaganed I suppose.

And then the unemployment benefits will start to run out for people, like me, where [sic] were Obamaed

There’s a reason the audio has been Nixoned.

For fun, I did some quick searches on COHA to see if there is any historical precedent for these kinds of uses, but couldn’t come up with any. Also, as the above example of Coolidged illustrates, the names that are verbed can be historical not just contemporary. So while you can go Eastwooding, you can also be Marie Antoinetted:

 

Earning Our Bread

20 Aug
August 20, 2012

A radio show in the US that focuses on economic issues, published something last week they call “Money slang: Marketplace’s urban finance dictionary.” The dictionary raises a few issues that I think are interesting and one that I want to check out using COHA.

The first interesting issue is what they choose to include. They use “urban” in the title and have a quite a few words that either emerge from or are popularized in hip hop and rap like Benjamins (Sean Combs and Notorious B.I.G.) or bling (B.G.). Then we have tuppence, which is not only not from hip hop (obviously) but doesn’t seem to me to be particularly slangy (from the OED):

The hodgepodge nature of the list raises the issue of how the compilers are defining “slang.” Slang is a notoriously messy category. Is it the newness of the words that matter? Or is it the community of speakers that matter–namely youth? Or is it the words’ use for social grouping rather than something like work or eluding authorities? All of these are questions that don’t necessarily have easy answers. If you’re interested in such things, you should check out Adams.

The other issue that I wanted to explore in more detail has to do with the purported origins of some of these words. Determining the etymology of slang is tough. Particularly before computer-mediated communication, slang was primarily coined and circulated in spoken discourse, which of course leaves little historical record. We rely, then, on written records, and by the time a slang word makes it into print, it has likely been circulating elsewhere for a while.

A related problem is that popular treatments like this article sometimes circulate false (or folk) etymologies. And there is an example here that I wanted to check out because the cited word origin seemed unlikely and because I thought it posed a challenging problem for using corpora.

The entry that caught my eye was for bread. Its says this:

Bread — May have originated from jazz great Lester Young. He is said to have asked “How does the bread smell?” when asking how much a gig was going to pay.

The authors have taken care to hedge their contention with the modal may, but the story of Lester Young struck me as perhaps apocryphal. So I thought it would be fun to investigate.

Bread as slang for money is a challenging search as its meaning as food is much more common. So we need to find ways to sort out the different uses. One strategy is to think about how the different uses might have different collocational patterns. One such pattern occurs with verbs. Unfortunately, one common verb, make, isn’t of much help, since you can “make bread” in both senses. As an alternative, I thought I’d try earn. A search for bread collocating with earn within four tokens to the left or right yields this result:

Most of the concordance lines are of this type from 1856:

workshops at the expense of the state, in which able-bodied citizens could earn their bread. Thus the people were taxed exorbitantly to maintain a costly and cumbersome and corrupting

This use is somewhat different from the slang use. It has a more generalized meaning of sustenance (as opposed to the specific meaning of money), and likely comes from Biblical metaphor “daily bread.” However, the pattern of collocation with earn and with work-related terms seems like a short semantic leap to the more restricted meaning of money, as it does in this concordance line from 1959:

I mean I ought to be out getting a job, man. Earning some bread for the old lady. Got to have money, got to have a job

My guess is that the slang is the result of semantic narrowing from an already existing metaphorical use of bread. But like I said, documenting the precise origin isn’t easy.

Bookstores, Dictionaries, & Linguistic Authority

14 Aug
August 14, 2012

Jimmy Santiago Baca is a American poet of Indio-Mexican descent, and the winner of multiple literary prizes. In an interview, he tells the story of his first exposure to English poetry. He was serving time in juvenile detention, and got a hold of a copy of Wordsworth. He was captivated, but didn’t understand all of the 18th/19th century lexicon. This is the part of the story that I found interesting:

Bookstores 

There is a body of scholarship that examines dictionaries as both tools for promoting emergent national identities and, alternatively, as tools of exclusion and expressions of linguistic (and, by implication social and political) authority.

However, his framing of bookstores as sites of colonial power and as inaccessible to those outside of mainstream (or dominant) culture is something that I don’t think as received a lot of attention (or at least a quick search on Google Scholar doesn’t turn up much).

For the rest of the interview, listen here.

 

 

Gender Pronouns in the News

12 Aug
August 12, 2012

An article in the news yesterday reported on a a study that is analyzing changes in the frequencies of the masculine (he/him/his) and feminine (she/her/hers) pronouns using Google books as a corpus. I’m sure most of you are familiar with Google books. In December of 2010, they introduced an interface that allows a limited set of searches of their digitized collection. The introduction of the interface was accompanied by an article in Science that weirdly didn’t acknowledge the existence of corpus linguistics and seemed to reinvent the field under the new (and clumsy) moniker of “culturomics”. The shortcomings of the article and the corpus were quickly addressed by linguists like Mark Liberman and Geoff Nunberg.

What these responses point out are the limitations of Google books as a corpus: it is not principled; it is not balanced; because its construction is automated, it contains numerous errors; etc. And my own experiences using Google books to find texts to include in my own corpora reinforce these kinds of concerns. I have often found texts with publication dates that are incorrect, as well as texts (particularly older texts) that have numerous errors from OCR scanning.

Given some of the potential problems with Google books as a corpus resource, I was, therefore, interested in putting some of the findings of of the pronoun study to the test. For my Coffee Break Experiment, I’m using the COHA rather than Google books, and Mark Davies has written a nice comparison of the two. I was curious if the findings of the study are confirmed or not.

For the purposes of this experiment, I’m only going to look at the subject pronouns he and she. First, using the BYU interface (rather than the n-gram viewer), here are the results from Google books:

Now, here are the results from COHA:

According to the AP article, the research find:

The ratio of male to female pronouns was roughly 3.5:1 until 1950, when the gap began to widen as more women stayed home after World War II, and peaked at around 4.5:1 in the mid-1960s. The ratio had shrunk to 3:1 by 1975, and less than 2:1 by 2005.

The data from COCA shows a pattern that is similar, though slightly different. Here is a chart showing the change in the ratio of frequencies over time:

The ratios in COHA are all lower than what is reported in the study. Also, COHA shows more movement early, then a steadier ratio of about 2.1:1 from 1860-1940. Then, however, we find the same inflection point–albeit at a lower ratio (3:1 vs. 4.5:1) in 1960.

Of course, what is most interesting about all of this–and the reason it gets attention in the press–is speculation as to why. What is driving these changes in frequency? The article quotes James Pennebaker. He suggests:

“Pronouns are a sign of people paying attention and as women become more present in the workforce, in the media and life in general, people are referring to them more.”

We should be careful about attributing psychological motivations to the production of specific linguistic features like pronouns. That said, there is clearly something going on that is driving these changes.