[SRILM User List] How to model unseen words without N_1
    Joris Pelemans 
    Joris.Pelemans at esat.kuleuven.be
       
    Mon Jun 17 01:03:44 PDT 2013
    
    
  
Hello,
I am trying to build a unigram model with only the 400k most frequent 
words (this is essential) out of a training set of 4M tokens. The 
language model has to be open i.e. include the <unk> tag, because I want 
to assign probabilities to unseen words. However, I don't want it to 
base the probability for <unk> on that part of 4M minus 400k words, 
because then <unk> would get way too much probability mass (since there 
is a lot of data that I do not include in my LM). I simply want to 
ignore the other words and build a <unk> model based on the Good-Turing 
intuition of count-of-counts. However, since I limit the training data 
to 400k words, my training data does not contain any words with a 
frequency of 1 (i.e. N_1 = 0).
How should I go about building this language model?
Thanks in advance,
Joris Pelemans
    
    
More information about the SRILM-User
mailing list