[SRILM User List] Adding n-grams to an existing LM

Sun Nov 3 16:01:40 PST 2013

On 11/3/2013 1:43 AM, Joris Pelemans wrote:
> On 11/03/13 02:35, Andreas Stolcke wrote:
>> On 11/2/2013 7:46 AM, Joris Pelemans wrote:
>>> On 11/02/13 02:07, Andreas Stolcke wrote:
>>>>
>>>> For example, if have p(c | a b) = x  and d and c synonyms, you set
>>>>
>>>> p(c | a b ) = x/2
>>>> p(d | a b) = x/2
>>>
>>> Another question with regards to this problem. Say, I don't know a 
>>> good synonym for d, but I still want to include it by mapping it 
>>> onto <unk> (what else, right?), obviously by a very small fraction 
>>> of the <unk> probability, since it's a class. The above technique 
>>> would lead to gigantic LMs, since <unk> is all over the place. Is 
>>> there a smart way in the SRILM toolkit that lets you specify that 
>>> some words should be modeled as <unk>?
>>
>> I'm not sure I understand what you mean.  <unk>  is a special word 
>> that all words not in the vocabulary are mapped to at test time.  So 
>> the way you 'model'  a word by <unk> is to not include it in the 
>> vocabulary of your LM.
> I am investigating different techniques to introduce new words to the 
> vocabulary. Say I have a vocabulary of 100,000 words and I want to 
> introduce 1 new word X (for the sake of simplicity). I could do one of 
> 3 options:
>
>  1. use the contexts in which X appears in some training data (but
>     sometimes X may not appear (enough))
>  2. estimate the probability of X by taking a fraction of the prob
>     mass of a synonym of X (which I described earlier)
>  3. estimate the probability of X by taking a fraction of the prob
>     mass of the <unk> class (if e.g. no good synonym is at hand)
>
> I could then compare the perplexities of these 3 LMs with a vocabulary 
> of size 100,001 words to see which technique is best for a given 
> word/situation.
>
And option 3 is effectively already implemented by the way unseen words 
are mapped to <unk>.  If you want to compute perplexity in a fair way 
you would take the LM containing <unk> and for every occurrence of X you 
add log p(X | <unk>)  (the share of unk-probability mass you want to 
give to X).  That way you don't need to add any ngrams to the LM.  What 
this effectively does is simulate a class-based Ngram model where <unk> 
is a class and X one of its members.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131103/7d3a56d0/attachment.html>