[SRILM User List] Adding n-grams to an existing LM

Sat Nov 2 07:46:31 PDT 2013

On 11/02/13 02:07, Andreas Stolcke wrote:
> On 11/2/2013 8:00 AM, Joris Pelemans wrote:
>>
>> What do these errors mean, can I ignore them or is there a better way 
>> to renormalize my new LMs?
>
> I think you should split the existing ngram probabilities among all 
> the synonyms, when the synonym occurs in the final position of the 
> ngram.  That would not add anything to the sums of probabilities 
> involved in the BOW computation.
>
> For example, if have p(c | a b) = x  and d and c synonyms, you set
>
> p(c | a b ) = x/2
> p(d | a b) = x/2

Another question with regards to this problem. Say, I don't know a good 
synonym for d, but I still want to include it by mapping it onto <unk> 
(what else, right?), obviously by a very small fraction of the <unk> 
probability, since it's a class. The above technique would lead to 
gigantic LMs, since <unk> is all over the place. Is there a smart way in 
the SRILM toolkit that lets you specify that some words should be 
modeled as <unk>?

Regards,

Joris