1,000,000 Word Sample Corpora
- Resources Corpora
Below are links to some corpora that I’ve put together using random sampling techniques. The idea is to take a much larger data set, reduce its size, but retain its representativeness. The advantage is that you end up with something that’s more manageable, particularly if you’re using an off-the-shelf concordancer like AntConc.
There is a fiction sample, an academic sample, and a MICUSP sample. These are intended to provide data for beginners and should be particularly useful for:
- Illustrating patterns in register/text-type variation.
- Use as reference corpora (say you want to generate a keyword list comparing Jane Austen’s collected works a generic fiction corpus in order to analyze her style).
- Lexical and morphological analysis.
These files were put together with R code build on Matthew Jockers’ “chunking function” that he created for Topic Modeling. His code takes a text file and divides it into chunks of a given size. My adaption, then, takes advantage of R’s random sampling capabilities to return any number of randomly selected chunks. Then, the code concatenates each chunk into a single file.
The fiction sample takes a 1000 word chunk from 1000 works of fiction published between 1800 and 2000. It also controls for decade, taking 50 chunks for each 10 year period. So the fiction sample is balanced over time.
The academic sample is a little different in that the corpus it comes from is a continuous text of articles in 40 different disciplines published between 2000 and 2013. Because it is drawn from one continuous file, each 1000 word chunk (or line in the text file) may span the end of one article and the beginning of the next.
The MICUSP sample was put together much like the fiction sample. However, some of the papers in MICUSP have fewer than 1000 words (and there are only 828 papers total), so the chunks are larger (about 1300 words) and not precisely the same length for each text.
Be aware of a few things about these files:
- All numbers and punctuation have been stripped out and all words are in lower case.
- Some of the original files came from OCRed scans, so there are a few errors (e.g., fused words like consideredherself). I’ve tried tried to reduce these as much as possible, but if you want to eliminate them further, you’ll have to comb through them.
- If you do a keyword comparison of the academic and fiction samples, you’ll notice a number of single letters or letter combinations. These are from variables and formulas in the academic papers. As I noted above, the numbers have been stripped out so many of them are just the skeletal remains of these textual features.
I hope you find them useful. Enjoy.