Song Lyrics Data Tables
- Resources Corpora
(Download links are below.)
Recently, a student of mine who is interested in analyzing song lyrics pointed me to the Million Song Dataset (see here), compiled by Lab ROSA at Columbia University. It’s a great and robust resource. However, it doesn’t have any dates in its metadata, because its source — musiXmatch — does not store that information. My student, like many linguistics researchers was interested in change over time. So I started poking around and discovered MusicBrainz, which backs ups its entire database of song and band metadata twice a week, and allows anyone to download all of that data (see here).
So I set out to link the metadata from MusicBrainz to the songs in the Million Song Dataset, using R. The hope being that we could generate normalized word counts and track diachronic changes in word frequencies. The linking process did not produce dates for all songs, so those without dates were dropped from the counts. The result, however, had an approximately 80% success rate, producing a dataset with approximately 44.5 million words.
A break down of the word counts by decade is in the table below.
decade | total words |
---|---|
1890 | 282 |
1900 | 290 |
1910 | 2310 |
1920 | 2167 |
1930 | 4044 |
1940 | 5037 |
1950 | 56968 |
1960 | 332107 |
1970 | 993649 |
1980 | 3645975 |
1990 | 12454060 |
2000 | 24770781 |
2010 | 2164230 |
Unsurprisingly, the data skews newer. Thus, normalizing from earlier decades overly weights the limited number of tokens that are present. With that in mind, I’m providing datatables in four versions:
- Full versions of the data with word totals by year or by decade
- Truncated versions with the word total by year or by decade from 1950 on.
The data is provided in 4 *.csv tables. Those tables have 5 columns:
- The word token
- The year or decade in which it occurs
- A count of that word
- A total count of all tokens occurring in that year
- A normalized (per million word) frequency of the word
With the data set up this way, it is easy to subset the data in either R or Excel and to create trend graphs like this one for the first person pronoun I:
That particular result is potentially provocative, as it runs counter to some studies (here) and popular publications (here) that have sought to document increasing first-person pronoun usage and connect it to supposed rises in narcissism.
Or we can generate a less surprising one for gangsta:
A couple of last notes about the data:
- It is in the bags-of-words format chosen for the Million Song Dataset in order to protect copyright.
- With that in mind, if you want to examine context, you will have to do so using supplemental sources.
- You also cannot process this data in a concordancer like AntConc. They are tables more fit to use in R or as a spreadsheet.
- The data is stemmed. You can see the explanation at the Million Song Dataset home (here)
- If you use the data, please cite both the data here and the Million Song Dataset.
To cite:
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. (2011) The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011)
Brown, D. W. (2017) Song Lyrics Data Tables. Retrieved from http://www.thegrammarlab.com