[SRILM User List] Adding n-grams to an existing LM

Joris Pelemans Joris.Pelemans at esat.kuleuven.be
Mon Nov 4 01:01:26 PST 2013


On 11/04/13 01:01, Andreas Stolcke wrote:
> On 11/3/2013 1:43 AM, Joris Pelemans wrote:
>> I am investigating different techniques to introduce new words to the 
>> vocabulary. Say I have a vocabulary of 100,000 words and I want to 
>> introduce 1 new word X (for the sake of simplicity). I could do one 
>> of 3 options:
>>
>>  1. use the contexts in which X appears in some training data (but
>>     sometimes X may not appear (enough))
>>  2. estimate the probability of X by taking a fraction of the prob
>>     mass of a synonym of X (which I described earlier)
>>  3. estimate the probability of X by taking a fraction of the prob
>>     mass of the <unk> class (if e.g. no good synonym is at hand)
>>
>> I could then compare the perplexities of these 3 LMs with a 
>> vocabulary of size 100,001 words to see which technique is best for a 
>> given word/situation.
>>
> And option 3 is effectively already implemented by the way unseen 
> words are mapped to <unk>.  If you want to compute perplexity in a 
> fair way you would take the LM containing <unk> and for every 
> occurrence of X you add log p(X | <unk>)  (the share of 
> unk-probability mass you want to give to X).  That way you don't need 
> to add any ngrams to the LM.  What this effectively does is simulate a 
> class-based Ngram model where <unk> is a class and X one of its members.
Yes, this is exactly what I meant when I asked for a "smart way in the 
SRILM toolkit", so I assume this is included. I looked up how to use 
class-based models and I think I found what I need to do. Is the 
following the correct way to calculate perplexity for these models?

ngram -lm class_lm.arpa -ppl test.txt -order n -classes expansions.class

where expansions.class contains lines like this:

<unk> p(X | <unk>) X
<unk> p(Y | <unk>) Y
<unk> 1-p(X | <unk>)-p(Y | <unk>) not_mapped

I assume the last line is necessary since the man page for 
"classes-format" says "All expansion probabilities for a given class 
should sum to one, although this is not necessarily enforced by the 
software and would lead to improper models."

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131104/a8ab5f4c/attachment.html>


More information about the SRILM-User mailing list