[SRILM User List] How to model unseen words without N_1
Joris Pelemans
Joris.Pelemans at esat.kuleuven.be
Mon Jun 17 01:03:44 PDT 2013
Hello,
I am trying to build a unigram model with only the 400k most frequent
words (this is essential) out of a training set of 4M tokens. The
language model has to be open i.e. include the <unk> tag, because I want
to assign probabilities to unseen words. However, I don't want it to
base the probability for <unk> on that part of 4M minus 400k words,
because then <unk> would get way too much probability mass (since there
is a lot of data that I do not include in my LM). I simply want to
ignore the other words and build a <unk> model based on the Good-Turing
intuition of count-of-counts. However, since I limit the training data
to 400k words, my training data does not contain any words with a
frequency of 1 (i.e. N_1 = 0).
How should I go about building this language model?
Thanks in advance,
Joris Pelemans
More information about the SRILM-User
mailing list