[SRILM User List] Adding n-grams to an existing LM

Andreas Stolcke stolcke at icsi.berkeley.edu
Fri Nov 1 18:07:00 PDT 2013


On 11/2/2013 8:00 AM, Joris Pelemans wrote:
> Hello,
>
> I have an existing 5-gram LM with KN discounting and I would like to 
> add new words to it. To estimate reasonable n-gram probabilities for a 
> new word, I am now using (a fraction of) the probabilities of a 
> synonym of the word. I am simply replacing every occurrence of the 
> synonym with the new word, copying the logprob (or slightly altering 
> it in case of a fraction) and alpha and adding the new line to the LM. 
> Obviously the resulting n-gram is no longer normalized. I thought I 
> would be able to fix this relatively easily with:
>
> ngram -lm src.arpa -order 5 -renorm -write-lm dest.arpa
>
> but I get a lot of errors of the type "BOW numerator for context is 
> ... < 0" and "BOW denominator for context is ... <= 0.

The BOW for a given context is is computed as 1 - sum of all 
higher-order probabilities (in a given context), divided by 1 - sum of 
all backoff probabilities for those same ngrams.  So, if you're adding 
ngrams to a context, those sums can exceed 1, and you end up with 
negative numerators and/or denominators.

The ngram -renorm option only recomputes the backoff weights to achieve 
normalization, it does not modified the explicitly given ngram 
probabilities.

>
> What do these errors mean, can I ignore them or is there a better way 
> to renormalize my new LMs?

I think you should split the existing ngram probabilities among all the 
synonyms, when the synonym occurs in the final position of the ngram.  
That would not add anything to the sums of probabilities involved in the 
BOW computation.

For example, if have p(c | a b) = x  and d and c synonyms, you set

p(c | a b ) = x/2
p(d | a b) = x/2

If, however, the synonyms occur in the context portion of the ngram, you 
can just copy the parameter (as you have been doing).

p( e | a c) = p(e | a d)

Then, use -renorm to recompute the BOWs.

Andreas



More information about the SRILM-User mailing list