[SRILM User List] threshold on maximal counts for LM estimation

Tue Mar 22 10:37:46 PDT 2011

zeeshan khan wrote:
> Thanks alot Andreas for your answers !
> I have another question.
>
> Using the ngram-count tool, is there a way to generate a count file 
> which contains counts only lower than a certain limit.
> For example, if I want to generate a count file which contains only 
> those N-grams which occurred less than 50 times in a corpus, how can I 
> do it with the ngram-count. May be it is very simple to do, but I 
> couldnt find it. Currently, I do it manually, but it is cumbersome and 
> time-consuming.
>
> There is a way to set the maximal count of N-grams of an order / n / 
> that are discounted under Good-Turing, but I couldnt find a way to set 
> a maximal count limit of all N-grams to be considered at all.
There is no way to do it using existing functions in ngram-count.

Even if there were a way to do it with a built-in function you would not 
really gain any efficiency because to know if something occurs more than 
N times you need to keep track of all counts to begin with.  So you're 
not going to be able to do much better than

ngram-count -text .... -write - | gawk '$NR < 50'   | gzip > 
counts-less-than-50.gz

Andreas

>
> Best Regards,
> Zeeshan.
>
>
>
>
>
>
> On Tue, Feb 8, 2011 at 8:26 AM, Andreas Stolcke 
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
>     zeeshan khan wrote:
>
>         Hi all,
>         I wanted to share my observation regarding the SRILM toolkit's
>         calculation of perplexities and the effect of  -vocab and
>         -limit-vocab on it, and wanted to know why this happens.
>
>
>         SRILM toolkit's ngram tool gives 3 different perplexities of
>         the SAME text if these options are used as follows.
>         P1: ngram -unk -map-unk '[UNKNOWN]'  -order 4 -lm <LM-FILE>
>         -ppl <TEXT-FILE> : gives the highest perplexity value
>
>         P2: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE> -order
>         4 -lm <LM-FILE> -ppl <TEXT-FILE> : gives perplexity value
>         lesser than P1 and greater than P3.
>
>     That's probably because your <VOCAB-FILE> contains more words than
>     the LM itself.  That means fewer words are mapped to '[UNKNOWN]'
>     and this changes which probabilities are looked up in the LM.  If
>     however your <VOCAB-FILE>  contains a subset of the vocabulary in
>     the LM itself then there should be no change in perplexity.  
>
>
>         P3: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE>
>         -limit-vocab -order 4 -lm <LM-FILE> -ppl <TEXT-FILE> : gives
>         perplexity value smaller than both P1 and P2.
>
>     This has the effect that only ngrams covered by the words in
>     <VOCAB-FILE> are read from the LM.
>     Presumably more words are now mapped to [UNKNOWN], but it's hard
>     to predict what happens to perplexity because you don't say what
>     the relationship between the vocabulary and the data in
>     <TEXT-FILE> is.
>     The purpose of -limit-vocab is to all and only the portions of the
>     LM that are needed by the input data.  Therefore, to make
>     meaningful use of this option you need to generate the vocabulary
>     from the <TEXT-FILE> in this case.
>
>
>         Can anyone tell me why this happens ? I thought the effect of
>         -vocab and -limit-vocab options is only on memory usage.
>
>     A good way to track down the differences is to use -debug 2,
>     capture the output in files, and use diff to see where they differ.
>
>     Andreas
>
>
>
>         Just for information, the VOCAB files are generated from
>         lattice files generated during a recognition process.
>
>
>         Thanks and Regards,
>
>
>         Zeeshan.
>         ------------------------------------------------------------------------
>
>         _______________________________________________
>         SRILM-User site list
>         SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
>         http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>