An article in the news yesterday reported on a a study that is analyzing changes in the frequencies of the masculine (he/him/his) and feminine (she/her/hers) pronouns using Google books as a corpus. I’m sure most of you are familiar with Google books. In December of 2010, they introduced an interface that allows a limited set of searches of their digitized collection. The introduction of the interface was accompanied by an article in Science that weirdly didn’t acknowledge the existence of corpus linguistics and seemed to reinvent the field under the new (and clumsy) moniker of “culturomics”. The shortcomings of the article and the corpus were quickly addressed by linguists like Mark Liberman and Geoff Nunberg.
What these responses point out are the limitations of Google books as a corpus: it is not principled; it is not balanced; because its construction is automated, it contains numerous errors; etc. And my own experiences using Google books to find texts to include in my own corpora reinforce these kinds of concerns. I have often found texts with publication dates that are incorrect, as well as texts (particularly older texts) that have numerous errors from OCR scanning.
Given some of the potential problems with Google books as a corpus resource, I was, therefore, interested in putting some of the findings of of the pronoun study to the test. For my Coffee Break Experiment, I’m using the COHA rather than Google books, and Mark Davies has written a nice comparison of the two. I was curious if the findings of the study are confirmed or not.
For the purposes of this experiment, I’m only going to look at the subject pronouns he and she. First, using the BYU interface (rather than the n-gram viewer), here are the results from Google books:
Now, here are the results from COHA:
According to the AP article, the research find:
The ratio of male to female pronouns was roughly 3.5:1 until 1950, when the gap began to widen as more women stayed home after World War II, and peaked at around 4.5:1 in the mid-1960s. The ratio had shrunk to 3:1 by 1975, and less than 2:1 by 2005.
The data from COCA shows a pattern that is similar, though slightly different. Here is a chart showing the change in the ratio of frequencies over time:
The ratios in COHA are all lower than what is reported in the study. Also, COHA shows more movement early, then a steadier ratio of about 2.1:1 from 1860-1940. Then, however, we find the same inflection point–albeit at a lower ratio (3:1 vs. 4.5:1) in 1960.
Of course, what is most interesting about all of this–and the reason it gets attention in the press–is speculation as to why. What is driving these changes in frequency? The article quotes James Pennebaker. He suggests:
“Pronouns are a sign of people paying attention and as women become more present in the workforce, in the media and life in general, people are referring to them more.”
We should be careful about attributing psychological motivations to the production of specific linguistic features like pronouns. That said, there is clearly something going on that is driving these changes.