[SRILM User List] threshold on maximal counts for LM estimation
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Mar 22 10:37:46 PDT 2011
zeeshan khan wrote:
> Thanks alot Andreas for your answers !
> I have another question.
>
> Using the ngram-count tool, is there a way to generate a count file
> which contains counts only lower than a certain limit.
> For example, if I want to generate a count file which contains only
> those N-grams which occurred less than 50 times in a corpus, how can I
> do it with the ngram-count. May be it is very simple to do, but I
> couldnt find it. Currently, I do it manually, but it is cumbersome and
> time-consuming.
>
> There is a way to set the maximal count of N-grams of an order / n /
> that are discounted under Good-Turing, but I couldnt find a way to set
> a maximal count limit of all N-grams to be considered at all.
There is no way to do it using existing functions in ngram-count.
Even if there were a way to do it with a built-in function you would not
really gain any efficiency because to know if something occurs more than
N times you need to keep track of all counts to begin with. So you're
not going to be able to do much better than
ngram-count -text .... -write - | gawk '$NR < 50' | gzip >
counts-less-than-50.gz
Andreas
>
> Best Regards,
> Zeeshan.
>
>
>
>
>
>
> On Tue, Feb 8, 2011 at 8:26 AM, Andreas Stolcke
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
> zeeshan khan wrote:
>
> Hi all,
> I wanted to share my observation regarding the SRILM toolkit's
> calculation of perplexities and the effect of -vocab and
> -limit-vocab on it, and wanted to know why this happens.
>
>
> SRILM toolkit's ngram tool gives 3 different perplexities of
> the SAME text if these options are used as follows.
> P1: ngram -unk -map-unk '[UNKNOWN]' -order 4 -lm <LM-FILE>
> -ppl <TEXT-FILE> : gives the highest perplexity value
>
> P2: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE> -order
> 4 -lm <LM-FILE> -ppl <TEXT-FILE> : gives perplexity value
> lesser than P1 and greater than P3.
>
> That's probably because your <VOCAB-FILE> contains more words than
> the LM itself. That means fewer words are mapped to '[UNKNOWN]'
> and this changes which probabilities are looked up in the LM. If
> however your <VOCAB-FILE> contains a subset of the vocabulary in
> the LM itself then there should be no change in perplexity.
>
>
> P3: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE>
> -limit-vocab -order 4 -lm <LM-FILE> -ppl <TEXT-FILE> : gives
> perplexity value smaller than both P1 and P2.
>
> This has the effect that only ngrams covered by the words in
> <VOCAB-FILE> are read from the LM.
> Presumably more words are now mapped to [UNKNOWN], but it's hard
> to predict what happens to perplexity because you don't say what
> the relationship between the vocabulary and the data in
> <TEXT-FILE> is.
> The purpose of -limit-vocab is to all and only the portions of the
> LM that are needed by the input data. Therefore, to make
> meaningful use of this option you need to generate the vocabulary
> from the <TEXT-FILE> in this case.
>
>
> Can anyone tell me why this happens ? I thought the effect of
> -vocab and -limit-vocab options is only on memory usage.
>
> A good way to track down the differences is to use -debug 2,
> capture the output in files, and use diff to see where they differ.
>
> Andreas
>
>
>
> Just for information, the VOCAB files are generated from
> lattice files generated during a recognition process.
>
>
> Thanks and Regards,
>
>
> Zeeshan.
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>
More information about the SRILM-User
mailing list