[SRILM User List] Fwd: Fwd: ngram-count

Andreas Stolcke stolcke at speech.sri.com
Fri Jan 8 13:52:55 PST 2010


On 1/8/2010 3:57 AM, Manuel Alves wrote:
>
>
> ---------- Forwarded message ----------
> From: *Manuel Alves* <beleira at gmail.com <mailto:beleira at gmail.com>>
> Date: Fri, Jan 8, 2010 at 10:40 AM
> Subject: Re: Fwd: ngram-count
> To: Andreas Stolcke <stolcke at speech.sri.com 
> <mailto:stolcke at speech.sri.com>>
>
>
> 1. ngram-count -text CETEMPublico1.7 -lm LM
> 2.I test it in this way:
>                              I use the client-server architecture of SRILM
>                              SERVER : ngram -lm ../$a -server-port 100 
> -order 3
>                              CLIENT   : ngram -use-server 
> 100\@localhost -cache-served-ngrams -ppl $ficheiro -debug 2 2>&1
>                              where $ficheiro is this:

>
>
>     p( observássemos | que ...)     =  0 [ -inf ]

> file final.txt: 6 sentences, 126 words, 0 OOVs
> 6 zeroprobs, logprob= -912.981 ppl= 1.7615e+07 ppl1= 4.05673e+07

It looks to me like everything is working as intended.   You are getting 
zeroprobs, but not a large number of them.
They are low-frequency words (like the one above), so it makes sense, 
since they are probably not contained in the training corpus.

The perplexity is quite high, but that could be because of a small, or 
mismatched training corpus.   You didn't include the output of the 
ngram-count program, it's possible that the GT (default) discounting 
method reported some problems that are not evident from your mail.

One thing to note is that with network-server LMs you don't get OOVs, 
because all words are implicitly added to the vocabulary. Consequently, 
OOVs are counted as zeroprobs instead, but both types of tokens are 
equivalent for perplexity computation.
Still, you could run
          ngram -lm ../$a -order 3  -ppl $ficheiro -debug 2
just to make sure you're getting the same result.

Andreas

> _Manuel Alves. _
>
> On Thu, Jan 7, 2010 at 8:35 PM, Andreas Stolcke 
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
>     On 1/6/2010 10:34 AM, Manuel Alves wrote:
>>
>>
>>     ---------- Forwarded message ----------
>>     From: *Manuel Alves* <beleira at gmail.com <mailto:beleira at gmail.com>>
>>     Date: Wed, Jan 6, 2010 at 6:33 PM
>>     Subject: ngram-count
>>     To: srilm-user at speech.sri.com <mailto:srilm-user at speech.sri.com>
>>
>>
>>     Hi people.
>>     I need help whith ngram-count because i am training a model but
>>     when after i try to use it some test example he gives me
>>     Zeroprobs in the output.
>>     This means that the model is bad trained?
>>     Please answer me.
>>     Best regards,
>>     Manuel Alves.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100108/bdda3f8a/attachment.html>


More information about the SRILM-User mailing list