[SRILM User List] How to model unseen words without N_1

Mon Jun 17 01:03:44 PDT 2013

Hello,

I am trying to build a unigram model with only the 400k most frequent 
words (this is essential) out of a training set of 4M tokens. The 
language model has to be open i.e. include the <unk> tag, because I want 
to assign probabilities to unseen words. However, I don't want it to 
base the probability for <unk> on that part of 4M minus 400k words, 
because then <unk> would get way too much probability mass (since there 
is a lot of data that I do not include in my LM). I simply want to 
ignore the other words and build a <unk> model based on the Good-Turing 
intuition of count-of-counts. However, since I limit the training data 
to 400k words, my training data does not contain any words with a 
frequency of 1 (i.e. N_1 = 0).

How should I go about building this language model?

Thanks in advance,

Joris Pelemans