Here is a list of some linguistic data you can access for your project:

(You might also consult University of Helsinki’s extensive and filterable corpus inventory here.)

Corpora with web-based interfaces:

  • The BYU family of corpora (link). A wide variety of monitor and specialized corpora, including some non-English ones. (POS tagged and allows for collocational analysis.)
  • The Corpus Eye (link) from Southern Denmark University. A variety of corpora, including emails and television news transcriptions. (POS tagged.)
  • IntelliText  (link) from the University of Leeds. A variety of English and non-English corpora. (POS tagged and allows for keyword comparisons.)
  • MICASE (link) from the University of Michigan. A corpus of spoken academic English. (Allows for speaker attribute searches.)
  • MICUSP (link) from the University of Michigan. A corpus of student academic writing. (Allows for text attribute searches.)
  • Old Bailey Online (link). A corpus of proceeds from the London court from 1764 to 1913. (Allows for a variety of metadata searches.)
  • RCPCE (link) from the Hong Kong Polytechnic University. A variety of profession-specific corpora including academic writing. (Allows for collocational analysis.)
  • SCOTS (link) from the University of Glasgow. A corpus of written a spoken Scottish. (Allows for speaker and text attribute searches.)

Downloadable corpora:

  • ACL (link) from the National University of Singapore. A corpus of publications in computational linguistics.
  • AusNC (link). A number of spoken, written, and historical corpora of Australian English.
  • BASE (link) from the University of Warwick. A corpus of spoken academic British English.
  • CLMET (link) from the University of Leuven. A corpus of late Early Modern English.
  • FRED (link) from the University of Freiburg. An English dialect corpus. (a million word sample is available by permission.)
  • ICE GB (link) and ICE Nigeria (link). Corpora from the International Corpus of English project. (A description of how these were put together is here.)
  • SBCSAE (link) from the University of Santa Barbara. A corpus of spoken English.
  • VOICE (link) from the the University of Vienna. A corpus of spoken ELF. (Download link at the bottom of the page.)