[SRILM User List] How to model unseen words without N_1

Anand Venkataraman venkataraman.anand at gmail.com
Mon Jun 17 11:12:50 PDT 2013


What Andreas suggests is probably the best. But depending on the exact
application you have in mind, one other option to consider is to simply
pre-process your input corpus and either delete all non-vocab words, or
replace them (or runs of them) with a special meta-word of your choice,
e.g. @reject at . It may be that there's there's an option in ngram* to do
these in-process, I must check the docs. Else, a simple pre-processing
filter in awk, perl or python should do the trick.

&


On Mon, Jun 17, 2013 at 10:16 AM, Andreas Stolcke <stolcke at icsi.berkeley.edu
> wrote:

> On 6/17/2013 1:03 AM, Joris Pelemans wrote:
>
>> Hello,
>>
>> I am trying to build a unigram model with only the 400k most frequent
>> words (this is essential) out of a training set of 4M tokens. The language
>> model has to be open i.e. include the <unk> tag, because I want to assign
>> probabilities to unseen words. However, I don't want it to base the
>> probability for <unk> on that part of 4M minus 400k words, because then
>> <unk> would get way too much probability mass (since there is a lot of data
>> that I do not include in my LM). I simply want to ignore the other words
>> and build a <unk> model based on the Good-Turing intuition of
>> count-of-counts. However, since I limit the training data to 400k words, my
>> training data does not contain any words with a frequency of 1 (i.e. N_1 =
>> 0).
>>
>> How should I go about building this language model?
>>
>
> To work around the problem of missing N_1 for estimating GT parameters,
> you should run ngram-count twice.  First, without vocabulary restriction,
> and saving the GT parameters to a file (with -gt1 FILE  and no -lm option).
>    Second, you run ngram-count again, with -vocab option, -lm and -gt1
> FILE.  This will read the smoothing parameters from FILE.   (The
> make-big-lm  wrapper script automates this two-step process.)
>
> I don't have a good solution for setting the <unk>  unigram probablity
> directly based on GT smoothing.    I would recommend one of two practical
> solutions.
> 1) Replace rare words in your training data with <unk>  ahead of running
> ngram-count (this also gives you ngrams that predict unseen words).
> 2) Interpolate your LM with an LM containing only <unk>  and optimize the
> interpolation weight on a held-out set.
>
> Of course you can always edit the LM file to insert <unk> with whatever
> probability you want (and possibly use ngram -renorm to renormalize the
> model).
>
> Andreas
>
>
> ______________________________**_________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/**mailman/listinfo/srilm-user<http://www.speech.sri.com/mailman/listinfo/srilm-user>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130617/1d58a5f9/attachment.html>


More information about the SRILM-User mailing list