[SRILM User List] 1-count Higher order ngrams not excluded by gtmin

Andreas Stolcke stolcke at icsi.berkeley.edu
Tue Oct 1 21:20:15 PDT 2013


On 9/28/2013 12:21 AM, Mohammed Mediani wrote:
> Dear Andreas,
> I noticed that when I train a 6-gram KN LM, I get some 1-count ngrams 
> which are no prefixes of any higher order ngrams in the 4 and 3 
> models. Are those another exception besides the one stated in Warning4 
> (http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html)? 
>
SRILM always includes unigrams for all words in the LM vocabulary. This 
happens to make up for some limitations of the ARPA format. It does not 
allow a separate definition of what the LM vocabulary is, so it is 
implicitly defined by the unigram list.  Also, there is no way to 
specify a backoff to "zero-grams" (uniform distribution), so unigram 
probabilities for all words (whether observed in the training set or 
not) are given explicitly.

Andreas



More information about the SRILM-User mailing list