Tuesday, June 4, 2013

Corpora

This 280KB text file at Leeds University gives the 15,000 commonest words from a corpus of Japanese text, for a vague value of 'words' (there are some odd symbols and punctuation quite high up the list).  Might be useful.  It'd be mildly interesting to do statistical crunches of the range of kanji used in the list.  Many more corpora for various languages here.