[SRILM User List] 1-count Higher order ngrams not excluded by gtmin
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Oct 1 21:20:15 PDT 2013
On 9/28/2013 12:21 AM, Mohammed Mediani wrote:
> Dear Andreas,
> I noticed that when I train a 6-gram KN LM, I get some 1-count ngrams
> which are no prefixes of any higher order ngrams in the 4 and 3
> models. Are those another exception besides the one stated in Warning4
> (http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html)?
>
SRILM always includes unigrams for all words in the LM vocabulary. This
happens to make up for some limitations of the ARPA format. It does not
allow a separate definition of what the LM vocabulary is, so it is
implicitly defined by the unigram list. Also, there is no way to
specify a backoff to "zero-grams" (uniform distribution), so unigram
probabilities for all words (whether observed in the training set or
not) are given explicitly.
Andreas
More information about the SRILM-User
mailing list