[SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker

Sun Jun 16 12:13:00 PDT 2013

On 6/15/2013 12:39 PM, Sander Maijers wrote:
> In the case of an LM created with '-skip', what is the meaning of the 
> values past "\end\"?
>
> They are of the form:
>
> a-team 0.5
> a-teens 0.5
> a-test 0.5

These are the skip probabilities estimated by the model.    0.5 is the 
default initial value, but after doing the EM estimation each word would 
have its individual probability of being skipped in the computation of 
condition probabilities.   With the above values you would get

P(w | a b "a-team" ) =  0.5 P'(w | a b)  + 0.5 P'(w | a b "a-team" )

and so on for all words.  Here P' is the probability as determined by a 
standard n-gram LM.
Note:  "a-team" is the word right before the word being predicted (w).

>
>
> I do not understand their relation to these 'ngram-count' parameters:
>
> -init-lm lmfile
>     Load an LM to initialize the parameters of the skip-N-gram.
As it says, you can start the estimation process with a preexisting set 
of parameters, read from a model file "lmfile".

> -skip-init value
>     The initial skip probability for all words.
Alternatively, you can initialize all skip probabilities to the same 
fixed value.
> -em-iters n
>     The maximum number of EM iterations.
> -em-delta d
>     The convergence criterion for EM: if the relative change in log 
> likelihood falls below the given value, iteration stops.
These are just standard parameters for an EM-type algorithm.

Andreas