[SRILM User List] Adding n-grams to an existing LM
Andreas Stolcke
stolcke at icsi.berkeley.edu
Fri Nov 1 18:07:00 PDT 2013
On 11/2/2013 8:00 AM, Joris Pelemans wrote:
> Hello,
>
> I have an existing 5-gram LM with KN discounting and I would like to
> add new words to it. To estimate reasonable n-gram probabilities for a
> new word, I am now using (a fraction of) the probabilities of a
> synonym of the word. I am simply replacing every occurrence of the
> synonym with the new word, copying the logprob (or slightly altering
> it in case of a fraction) and alpha and adding the new line to the LM.
> Obviously the resulting n-gram is no longer normalized. I thought I
> would be able to fix this relatively easily with:
>
> ngram -lm src.arpa -order 5 -renorm -write-lm dest.arpa
>
> but I get a lot of errors of the type "BOW numerator for context is
> ... < 0" and "BOW denominator for context is ... <= 0.
The BOW for a given context is is computed as 1 - sum of all
higher-order probabilities (in a given context), divided by 1 - sum of
all backoff probabilities for those same ngrams. So, if you're adding
ngrams to a context, those sums can exceed 1, and you end up with
negative numerators and/or denominators.
The ngram -renorm option only recomputes the backoff weights to achieve
normalization, it does not modified the explicitly given ngram
probabilities.
>
> What do these errors mean, can I ignore them or is there a better way
> to renormalize my new LMs?
I think you should split the existing ngram probabilities among all the
synonyms, when the synonym occurs in the final position of the ngram.
That would not add anything to the sums of probabilities involved in the
BOW computation.
For example, if have p(c | a b) = x and d and c synonyms, you set
p(c | a b ) = x/2
p(d | a b) = x/2
If, however, the synonyms occur in the context portion of the ngram, you
can just copy the parameter (as you have been doing).
p( e | a c) = p(e | a d)
Then, use -renorm to recompute the BOWs.
Andreas
More information about the SRILM-User
mailing list