can't get right counts-entropy

Andreas Stolcke stolcke at speech.sri.com
Tue Jan 22 15:36:32 PST 2008


SAI TANG HUANG wrote:
> Hi,
>
> I have created a counts file and a back-off LM file from a text file with sentences with the following command:
>
> sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file 
>
> Then I ran the ngram program with -counts here is the output:
>
> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file 
> file count_file: 23640 sentences, 460074 words, 0 OOVs
> 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575
> sai at uk-notebook:~/Desktop$ 
>
> I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get:
>   
The 7880 zeroprobs are probably due to the <s> tokens output by the 
ngram-count program.
you cannot use the ngram-count output directly as input to ngram 
-counts. See below.
> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt 
> file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs
> 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985
> sai at uk-notebook:~/Desktop$ 
>
> Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ?
>
> If there is a more detailed manual or document describing these values then I'm willing to read it.
>   
This is not yet well documented.    To use ngram -counts correctly to 
must only feed those N-grams that correspond to "events" in the LM, not 
those that only appear as "context".   That means you need to filter the 
ngram-count output and retain only ngrams that

- are of the highest order (e.g., trigrams for a trigram LM), OR
- start with <s> (but not the <s> unigam, see above).

For example, the sentence "a b c" in conjunction with a trigram LM 
should generate only the ngrams

<s> a
<s> a b
a b c
b c </s>

You can do this filtering with a small perl or gawk script.

Sounds like another topic for the FAQ file.

Andreas


> Thanks a lot,
>
> Sai
> _________________________________________________________________
> Tecnología, moda, motor, viajes,…suscríbete a nuestros boletines para estar siempre a la última
> http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com
>   





More information about the SRILM-User mailing list