[SRILM User List] Adding n-grams to an existing LM
Andreas Stolcke
stolcke at icsi.berkeley.edu
Sun Nov 3 16:01:40 PST 2013
On 11/3/2013 1:43 AM, Joris Pelemans wrote:
> On 11/03/13 02:35, Andreas Stolcke wrote:
>> On 11/2/2013 7:46 AM, Joris Pelemans wrote:
>>> On 11/02/13 02:07, Andreas Stolcke wrote:
>>>>
>>>> For example, if have p(c | a b) = x and d and c synonyms, you set
>>>>
>>>> p(c | a b ) = x/2
>>>> p(d | a b) = x/2
>>>
>>> Another question with regards to this problem. Say, I don't know a
>>> good synonym for d, but I still want to include it by mapping it
>>> onto <unk> (what else, right?), obviously by a very small fraction
>>> of the <unk> probability, since it's a class. The above technique
>>> would lead to gigantic LMs, since <unk> is all over the place. Is
>>> there a smart way in the SRILM toolkit that lets you specify that
>>> some words should be modeled as <unk>?
>>
>> I'm not sure I understand what you mean. <unk> is a special word
>> that all words not in the vocabulary are mapped to at test time. So
>> the way you 'model' a word by <unk> is to not include it in the
>> vocabulary of your LM.
> I am investigating different techniques to introduce new words to the
> vocabulary. Say I have a vocabulary of 100,000 words and I want to
> introduce 1 new word X (for the sake of simplicity). I could do one of
> 3 options:
>
> 1. use the contexts in which X appears in some training data (but
> sometimes X may not appear (enough))
> 2. estimate the probability of X by taking a fraction of the prob
> mass of a synonym of X (which I described earlier)
> 3. estimate the probability of X by taking a fraction of the prob
> mass of the <unk> class (if e.g. no good synonym is at hand)
>
> I could then compare the perplexities of these 3 LMs with a vocabulary
> of size 100,001 words to see which technique is best for a given
> word/situation.
>
And option 3 is effectively already implemented by the way unseen words
are mapped to <unk>. If you want to compute perplexity in a fair way
you would take the LM containing <unk> and for every occurrence of X you
add log p(X | <unk>) (the share of unk-probability mass you want to
give to X). That way you don't need to add any ngrams to the LM. What
this effectively does is simulate a class-based Ngram model where <unk>
is a class and X one of its members.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131103/7d3a56d0/attachment.html>
More information about the SRILM-User
mailing list