-gt1min

Andreas Stolcke stolcke at speech.sri.com
Wed Nov 1 09:27:32 PST 2006


In message <45484C95.4030401 at web.de>you wrote:
> Andreas Stolcke wrote:
> > In message <45475E03.4040105 at web.de>you wrote:
> >> 	Hi Andreas,
> >>
> >> ngram-count effectively ignores the -gt1min option, i.e. the cutoff
> >> value for unigrams. Is that the desired behavior?
> > 
> > How ddo you reach this conclusions?
> > 
> > Andreas 
> > 
> > 
> e.g.,
> ngram-count -order 1 -gt1min 1 -text <text> -lm lm1
> ngram-count -order 1 -gt1min 5 -text <text> -lm lm5
> both produce the same list of unigrams (same length), just the logprob
> changes. I would have expected unigrams below gt1min being pruned (as
> are ngrams of higher order) and hence the list in lm5 being shorter...
> 
> Ronny
> 
> -- 
> ------------------------------------
> Ronny Melz
> IfI, NLP Dept, University of Leipzig
> Augustusplatz 10/11
> 04109 Leipzig, Germany
> ------------------------------------
> 

Ronny,

the fact that all words appear in the unigrams does not mean that -gt1min
doesn't work.  For historical reasons the unigram list also serves the 
purpose of listing the vocabulary of the LM.  Therefore SRILM always 
includes all words in the unigrams.  However, those words that are excluded
by -gt1min would get a probability that corresponds to the zero-order backoff
probability.  Zero-order backoff probabilities are obtained by distributing 
the probability mass left over from unigram discounting over all 
words.

If you want to exclude certain words from the LM altogether use the 
-vocab option.

Andreas 





More information about the SRILM-User mailing list