Open-vocabulary LM

Andreas Stolcke stolcke at
Tue Feb 25 09:02:59 PST 2003


it is possible if there are no unknown words in your data, or if 
you didn't specify a vocabulary file (because then all words are 
added implicitly).   It is also possible that you set ngram cutoffs
such that all ngrams involving <unk> fall below the cutoffs and are
therefore excluded from the LM. 

To understand what's going on run ngram-count with


(in addition to the other options you use) and check what ngrams are
generated containing <unk>.


In message <3E5B960C.6010704 at>you wrote:
> Hi,
> Is it normal that in an open-vocabulary LM (built with the "-unk" 
> option) the <unk> token is present as unigram, but not in bigrams and 
> trigrams?
> (Sorry if this is a silly question, but I am not so familiar with 
> language models, and I was told that it would not be the case with other 
> toolkits).
> Thanks again,
> Amélie
> -- 
> --------------------------------------------------------------------
> Amélie DELTOUR
> ENSIMAG / Universität Karlsruhe
> E-mail : amelie.deltour at
> --------------------------------------------------------------------

More information about the SRILM-User mailing list