[SRILM User List] Adding n-grams to an existing LM
Joris Pelemans
Joris.Pelemans at esat.kuleuven.be
Mon Nov 4 01:01:26 PST 2013
On 11/04/13 01:01, Andreas Stolcke wrote:
> On 11/3/2013 1:43 AM, Joris Pelemans wrote:
>> I am investigating different techniques to introduce new words to the
>> vocabulary. Say I have a vocabulary of 100,000 words and I want to
>> introduce 1 new word X (for the sake of simplicity). I could do one
>> of 3 options:
>>
>> 1. use the contexts in which X appears in some training data (but
>> sometimes X may not appear (enough))
>> 2. estimate the probability of X by taking a fraction of the prob
>> mass of a synonym of X (which I described earlier)
>> 3. estimate the probability of X by taking a fraction of the prob
>> mass of the <unk> class (if e.g. no good synonym is at hand)
>>
>> I could then compare the perplexities of these 3 LMs with a
>> vocabulary of size 100,001 words to see which technique is best for a
>> given word/situation.
>>
> And option 3 is effectively already implemented by the way unseen
> words are mapped to <unk>. If you want to compute perplexity in a
> fair way you would take the LM containing <unk> and for every
> occurrence of X you add log p(X | <unk>) (the share of
> unk-probability mass you want to give to X). That way you don't need
> to add any ngrams to the LM. What this effectively does is simulate a
> class-based Ngram model where <unk> is a class and X one of its members.
Yes, this is exactly what I meant when I asked for a "smart way in the
SRILM toolkit", so I assume this is included. I looked up how to use
class-based models and I think I found what I need to do. Is the
following the correct way to calculate perplexity for these models?
ngram -lm class_lm.arpa -ppl test.txt -order n -classes expansions.class
where expansions.class contains lines like this:
<unk> p(X | <unk>) X
<unk> p(Y | <unk>) Y
<unk> 1-p(X | <unk>)-p(Y | <unk>) not_mapped
I assume the last line is necessary since the man page for
"classes-format" says "All expansion probabilities for a given class
should sum to one, although this is not necessarily enforced by the
software and would lead to improper models."
Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131104/a8ab5f4c/attachment.html>
More information about the SRILM-User
mailing list