[SRILM User List] class based model

Mon Jan 6 11:31:42 PST 2014

On 1/6/2014 7:45 AM, DUGAST Loic wrote:
> Hi
>
> In the FAQ 
> (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html)
>
> You advise to ...
>
> c)
>     Lower the minimum counts for N-grams included in the LM, i.e., the
>     values of the options *-gt2min*, *-gt3min*, *-gt4min*, etc. The
>     higher order N-grams typically get higher minimum counts. 
>
>
> Do you not mean : *rise* the minimum counts (...) instead ?
>

You are correct.   It should say raise the min counts.  We'll fix the 
documentation ASAP.

> Plus I am not sure to understand why gt2min should be set higher than 
> gt1min etc ?
> Higher-order ngrams  are naturally less frequent. Therefore the same 
> cutoff value (gt2min equal to gt1min)will be harsher to bigrams than 
> to unigrams... Can you explain ?

The minimum counts are a crude way to trade off performance for space, 
and since there are lot more long ngrams than short ngrams you get more 
space savings with higher order ngrams.  It is typically not worth it to 
eliminate unigrams and bigrams, but a decent tradeoff to remove 
singleton trigrams and fourgrams.  The default values were chose based 
on historical practice (I think they might have even been inherited from 
the CMU LM toolkit).

The better and more principled way to remove ngrams is entropy-based 
pruning (ngram/ngram-count -prune option).   So the best strategy given 
limited memory is to make the gtmin values as low are you can afford to 
fit into memory, then use -prune (you can do this in the same invocation 
of ngram-count or make-big-lm).

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140106/082e64e6/attachment.html>