[SRILM User List] How to model unseen words without N_1
Andreas Stolcke
stolcke at icsi.berkeley.edu
Mon Jun 17 10:16:42 PDT 2013
On 6/17/2013 1:03 AM, Joris Pelemans wrote:
> Hello,
>
> I am trying to build a unigram model with only the 400k most frequent
> words (this is essential) out of a training set of 4M tokens. The
> language model has to be open i.e. include the <unk> tag, because I
> want to assign probabilities to unseen words. However, I don't want it
> to base the probability for <unk> on that part of 4M minus 400k words,
> because then <unk> would get way too much probability mass (since
> there is a lot of data that I do not include in my LM). I simply want
> to ignore the other words and build a <unk> model based on the
> Good-Turing intuition of count-of-counts. However, since I limit the
> training data to 400k words, my training data does not contain any
> words with a frequency of 1 (i.e. N_1 = 0).
>
> How should I go about building this language model?
To work around the problem of missing N_1 for estimating GT parameters,
you should run ngram-count twice. First, without vocabulary
restriction, and saving the GT parameters to a file (with -gt1 FILE and
no -lm option). Second, you run ngram-count again, with -vocab
option, -lm and -gt1 FILE. This will read the smoothing parameters from
FILE. (The make-big-lm wrapper script automates this two-step process.)
I don't have a good solution for setting the <unk> unigram probablity
directly based on GT smoothing. I would recommend one of two
practical solutions.
1) Replace rare words in your training data with <unk> ahead of running
ngram-count (this also gives you ngrams that predict unseen words).
2) Interpolate your LM with an LM containing only <unk> and optimize
the interpolation weight on a held-out set.
Of course you can always edit the LM file to insert <unk> with whatever
probability you want (and possibly use ngram -renorm to renormalize the
model).
Andreas
More information about the SRILM-User
mailing list