[SRILM User List] How does the option "-gtmin" work in ngram-count?

Tue Apr 10 16:41:15 PDT 2012

On 4/10/2012 12:21 AM, bulusheva wrote:
> Hi, I have two questions:
>
> 1. If I generate the language model with Kneser-Ney smoothing (or 
> Modified Kneser-Ney), why do the parameter "-gtnmin" apply to already 
> modified counts?
>
>     For example, if in the training data 2-gram "markov model" occurs
>     only in the context "hidden markov model" and gt2min = 2, then the
>     modified count for "markov model" = n(* markov model) = 1 < gt2min
>     and
>     prob("markov model") = bow("markov")*prob("model").
>     Instead of  prob("markov model") = ( n(* markov model)  - D)/ n(*
>     markov *) ;
>
That's how it is currently implemented.   It is debatable how the 
minimum count should be applied in the case of the lower-order 
distributions in KN models.
The way it currently works is natural from an implementation 
perspective,  because the lower-order counts are physically modified 
before applying the discounting (you can examine them by adding -write 
COUNTS).

But you are raising a good point.  It might make more sense to have the 
-gtXmin values be interpreted independent of the discounting method.

>
>     2. Let say I use ngram-count to generate the language model as
>     following:
>     ngram-count -text text.txt -vocab vocab.txt -gt1min 5 -lm sri.lm
>     Let the word "hello" exists in "vocab.txt" and occurs 4 times in
>     "text.txt". Then probability of "hello" is calculated as 
>     probability of zerotone. Is it correct?
>
That is correct, but the ARPA format doesn't allow you to prune 
unigrams, so the unigrams will always appear explicitly listed in the 
LM, even if their probabilities might be obtained by backing off to a 
uniform distribution.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120410/c48d596f/attachment.html>