[SRILM User List] Adding n-grams to an existing LM

Joris Pelemans Joris.Pelemans at esat.kuleuven.be
Sun Nov 3 01:43:55 PST 2013


On 11/03/13 02:35, Andreas Stolcke wrote:
> On 11/2/2013 7:46 AM, Joris Pelemans wrote:
>> On 11/02/13 02:07, Andreas Stolcke wrote:
>>>
>>> For example, if have p(c | a b) = x  and d and c synonyms, you set
>>>
>>> p(c | a b ) = x/2
>>> p(d | a b) = x/2
>>
>> Another question with regards to this problem. Say, I don't know a 
>> good synonym for d, but I still want to include it by mapping it onto 
>> <unk> (what else, right?), obviously by a very small fraction of the 
>> <unk> probability, since it's a class. The above technique would lead 
>> to gigantic LMs, since <unk> is all over the place. Is there a smart 
>> way in the SRILM toolkit that lets you specify that some words should 
>> be modeled as <unk>?
>
> I'm not sure I understand what you mean.  <unk>  is a special word 
> that all words not in the vocabulary are mapped to at test time.  So 
> the way you 'model'  a word by <unk> is to not include it in the 
> vocabulary of your LM.
I am investigating different techniques to introduce new words to the 
vocabulary. Say I have a vocabulary of 100,000 words and I want to 
introduce 1 new word X (for the sake of simplicity). I could do one of 3 
options:

 1. use the contexts in which X appears in some training data (but
    sometimes X may not appear (enough))
 2. estimate the probability of X by taking a fraction of the prob mass
    of a synonym of X (which I described earlier)
 3. estimate the probability of X by taking a fraction of the prob mass
    of the <unk> class (if e.g. no good synonym is at hand)

I could then compare the perplexities of these 3 LMs with a vocabulary 
of size 100,001 words to see which technique is best for a given 
word/situation.

Joris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131103/f18ab666/attachment.html>


More information about the SRILM-User mailing list