[SRILM User List] Adding n-grams to an existing LM
Joris Pelemans
Joris.Pelemans at esat.kuleuven.be
Sun Nov 3 01:43:55 PST 2013
On 11/03/13 02:35, Andreas Stolcke wrote:
> On 11/2/2013 7:46 AM, Joris Pelemans wrote:
>> On 11/02/13 02:07, Andreas Stolcke wrote:
>>>
>>> For example, if have p(c | a b) = x and d and c synonyms, you set
>>>
>>> p(c | a b ) = x/2
>>> p(d | a b) = x/2
>>
>> Another question with regards to this problem. Say, I don't know a
>> good synonym for d, but I still want to include it by mapping it onto
>> <unk> (what else, right?), obviously by a very small fraction of the
>> <unk> probability, since it's a class. The above technique would lead
>> to gigantic LMs, since <unk> is all over the place. Is there a smart
>> way in the SRILM toolkit that lets you specify that some words should
>> be modeled as <unk>?
>
> I'm not sure I understand what you mean. <unk> is a special word
> that all words not in the vocabulary are mapped to at test time. So
> the way you 'model' a word by <unk> is to not include it in the
> vocabulary of your LM.
I am investigating different techniques to introduce new words to the
vocabulary. Say I have a vocabulary of 100,000 words and I want to
introduce 1 new word X (for the sake of simplicity). I could do one of 3
options:
1. use the contexts in which X appears in some training data (but
sometimes X may not appear (enough))
2. estimate the probability of X by taking a fraction of the prob mass
of a synonym of X (which I described earlier)
3. estimate the probability of X by taking a fraction of the prob mass
of the <unk> class (if e.g. no good synonym is at hand)
I could then compare the perplexities of these 3 LMs with a vocabulary
of size 100,001 words to see which technique is best for a given
word/situation.
Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131103/f18ab666/attachment.html>
More information about the SRILM-User
mailing list