TOOLBOX: Useful corpora and analytical tools

Spoken English

UK US Academic
BNC-spoken (10 M) SBCSAE (250k) MICASE (US, 1.7 M)
BNC32 (1M) COCA - spoken (90 M) BASE (UK, 1.6 M)
    ADVICe (NZ 160 k)

Written English

UK US Academic
BNC-written (90 M) COCA - written (348 M) BAWE (UK, 6.5 M)
OEC (2 billion)   COCA-academic (86 M)
    Google Scholar (billions)

International English

ICE (10 available, 1M each)

Historical English

UK US International
CEEC (5 M) COHA (400 M) Google books (155 billion)
Helsinki corpus (1.6 M)    

Other languages

Language Corpus Characteristics
Czech SYN (1.3 billion) synchronic, written corpus, growing
Czech ORAL2008 (1M) informal speech
Slovak prim-5 (719 M) writing
Māori LMC (5M) 19th century legal writing

VoiceWalker (simple but effective transcription tool for creation of spoken corpora)
AntConc (free concordance software)
LL calculator (corpus comparision spreadsheet)

back
   
  (c) Vaclav Brezina 2013