[SRILM User List] Adding n-grams to an existing LM

Sat Nov 2 18:35:07 PDT 2013

On 11/2/2013 7:46 AM, Joris Pelemans wrote:
> On 11/02/13 02:07, Andreas Stolcke wrote:
>>
>> For example, if have p(c | a b) = x  and d and c synonyms, you set
>>
>> p(c | a b ) = x/2
>> p(d | a b) = x/2
>
> Another question with regards to this problem. Say, I don't know a 
> good synonym for d, but I still want to include it by mapping it onto 
> <unk> (what else, right?), obviously by a very small fraction of the 
> <unk> probability, since it's a class. The above technique would lead 
> to gigantic LMs, since <unk> is all over the place. Is there a smart 
> way in the SRILM toolkit that lets you specify that some words should 
> be modeled as <unk>?

I'm not sure I understand what you mean.  <unk>  is a special word that 
all words not in the vocabulary are mapped to at test time.  So the way 
you 'model'  a word by <unk> is to not include it in the vocabulary of 
your LM.

Andreas