[SRILM User List] How to model unseen words without N_1

Mon Jun 17 10:16:42 PDT 2013

On 6/17/2013 1:03 AM, Joris Pelemans wrote:
> Hello,
>
> I am trying to build a unigram model with only the 400k most frequent 
> words (this is essential) out of a training set of 4M tokens. The 
> language model has to be open i.e. include the <unk> tag, because I 
> want to assign probabilities to unseen words. However, I don't want it 
> to base the probability for <unk> on that part of 4M minus 400k words, 
> because then <unk> would get way too much probability mass (since 
> there is a lot of data that I do not include in my LM). I simply want 
> to ignore the other words and build a <unk> model based on the 
> Good-Turing intuition of count-of-counts. However, since I limit the 
> training data to 400k words, my training data does not contain any 
> words with a frequency of 1 (i.e. N_1 = 0).
>
> How should I go about building this language model?

To work around the problem of missing N_1 for estimating GT parameters, 
you should run ngram-count twice.  First, without vocabulary 
restriction, and saving the GT parameters to a file (with -gt1 FILE  and 
no -lm option).    Second, you run ngram-count again, with -vocab 
option, -lm and -gt1 FILE.  This will read the smoothing parameters from 
FILE.   (The make-big-lm  wrapper script automates this two-step process.)

I don't have a good solution for setting the <unk>  unigram probablity 
directly based on GT smoothing.    I would recommend one of two 
practical solutions.
1) Replace rare words in your training data with <unk>  ahead of running 
ngram-count (this also gives you ngrams that predict unseen words).
2) Interpolate your LM with an LM containing only <unk>  and optimize 
the interpolation weight on a held-out set.

Of course you can always edit the LM file to insert <unk> with whatever 
probability you want (and possibly use ngram -renorm to renormalize the 
model).

Andreas