[SRILM User List] Adding n-grams to an existing LM
Andreas Stolcke
stolcke at icsi.berkeley.edu
Mon Nov 4 09:16:25 PST 2013
On 11/4/2013 1:01 AM, Joris Pelemans wrote:
> On 11/04/13 01:01, Andreas Stolcke wrote:
>> On 11/3/2013 1:43 AM, Joris Pelemans wrote:
>>> I am investigating different techniques to introduce new words to
>>> the vocabulary. Say I have a vocabulary of 100,000 words and I want
>>> to introduce 1 new word X (for the sake of simplicity). I could do
>>> one of 3 options:
>>>
>>> 1. use the contexts in which X appears in some training data (but
>>> sometimes X may not appear (enough))
>>> 2. estimate the probability of X by taking a fraction of the prob
>>> mass of a synonym of X (which I described earlier)
>>> 3. estimate the probability of X by taking a fraction of the prob
>>> mass of the <unk> class (if e.g. no good synonym is at hand)
>>>
>>> I could then compare the perplexities of these 3 LMs with a
>>> vocabulary of size 100,001 words to see which technique is best for
>>> a given word/situation.
>>>
>> And option 3 is effectively already implemented by the way unseen
>> words are mapped to <unk>. If you want to compute perplexity in a
>> fair way you would take the LM containing <unk> and for every
>> occurrence of X you add log p(X | <unk>) (the share of
>> unk-probability mass you want to give to X). That way you don't need
>> to add any ngrams to the LM. What this effectively does is simulate
>> a class-based Ngram model where <unk> is a class and X one of its
>> members.
> Yes, this is exactly what I meant when I asked for a "smart way in the
> SRILM toolkit", so I assume this is included. I looked up how to use
> class-based models and I think I found what I need to do. Is the
> following the correct way to calculate perplexity for these models?
>
> ngram -lm class_lm.arpa -ppl test.txt -order n -classes expansions.class
>
> where expansions.class contains lines like this:
>
> <unk> p(X | <unk>) X
> <unk> p(Y | <unk>) Y
> <unk> 1-p(X | <unk>)-p(Y | <unk>) not_mapped
Yes, except you have to use a new class symbol, like UNKWORD, and
replace the "not_mapped" with the standard <unk>.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131104/c1d65522/attachment.html>
More information about the SRILM-User
mailing list