[SRILM User List] Fwd: Fwd: ngram-count

Andreas Stolcke stolcke at speech.sri.com
Mon Jan 11 08:54:27 PST 2010


On 1/11/2010 5:08 AM, Manuel Alves wrote:
> Hi again.
> What about using the -unk in the ngram-count command?
> The OOVś and the zeroprobs disapear?
>
Please read the FAQ items starting at D6 to understand the handling of 
OOVs  and zeroprobs, and what -unk does to the model.

Andreas

>
> On Mon, Jan 11, 2010 at 12:00 PM, Manuel Alves <beleira at gmail.com 
> <mailto:beleira at gmail.com>> wrote:
>
>
>
>     ---------- Forwarded message ----------
>     From: *Manuel Alves* <beleira at gmail.com <mailto:beleira at gmail.com>>
>     Date: Mon, Jan 11, 2010 at 11:49 AM
>     Subject: Re: [SRILM User List] Fwd: Fwd: ngram-count
>     To: Andreas Stolcke <stolcke at speech.sri.com
>     <mailto:stolcke at speech.sri.com>>
>
>
>     Hi  Andreas.
>     The output of the ngram-count was:
>                                                    [root at localhost
>     Corporas]# ../srilm/bin/i686/ngram-count -order 3 -text
>     CETEMPublico1.7 -lm LM
>                                                    warning: discount
>     coeff 1 is out of range: 1.44451e-17
>
>     I dont know if there is any problem with GT discount method.
>
>
>     On Fri, Jan 8, 2010 at 9:52 PM, Andreas Stolcke
>     <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
>         On 1/8/2010 3:57 AM, Manuel Alves wrote:
>>
>>
>>         ---------- Forwarded message ----------
>>         From: *Manuel Alves* <beleira at gmail.com
>>         <mailto:beleira at gmail.com>>
>>         Date: Fri, Jan 8, 2010 at 10:40 AM
>>         Subject: Re: Fwd: ngram-count
>>         To: Andreas Stolcke <stolcke at speech.sri.com
>>         <mailto:stolcke at speech.sri.com>>
>>
>>
>>         1. ngram-count -text CETEMPublico1.7 -lm LM
>>         2.I test it in this way:
>>                                      I use the client-server
>>         architecture of SRILM
>>                                      SERVER : ngram -lm ../$a
>>         -server-port 100 -order 3
>>                                      CLIENT   : ngram -use-server
>>         100\@localhost -cache-served-ngrams -ppl $ficheiro -debug 2 2>&1
>>                                      where $ficheiro is this:
>
>>
>>
>>             p( observássemos | que ...)     =  0 [ -inf ]
>
>>         file final.txt: 6 sentences, 126 words, 0 OOVs
>>         6 zeroprobs, logprob= -912.981 ppl= 1.7615e+07 ppl1= 4.05673e+07
>
>         It looks to me like everything is working as intended.   You
>         are getting zeroprobs, but not a large number of them.
>         They are low-frequency words (like the one above), so it makes
>         sense, since they are probably not contained in the training
>         corpus.
>
>         The perplexity is quite high, but that could be because of a
>         small, or mismatched training corpus.   You didn't include the
>         output of the ngram-count program, it's possible that the GT
>         (default) discounting method reported some problems that are
>         not evident from your mail.
>
>         One thing to note is that with network-server LMs you don't
>         get OOVs, because all words are implicitly added to the
>         vocabulary. Consequently, OOVs are counted as zeroprobs
>         instead, but both types of tokens are equivalent for
>         perplexity computation.
>         Still, you could run
>                  ngram -lm ../$a -order 3  -ppl $ficheiro -debug 2
>         just to make sure you're getting the same result.
>
>         Andreas
>
>
>>         _Manuel Alves. _
>>
>>         On Thu, Jan 7, 2010 at 8:35 PM, Andreas Stolcke
>>         <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>>
>>             On 1/6/2010 10:34 AM, Manuel Alves wrote:
>>>
>>>
>>>             ---------- Forwarded message ----------
>>>             From: *Manuel Alves* <beleira at gmail.com
>>>             <mailto:beleira at gmail.com>>
>>>             Date: Wed, Jan 6, 2010 at 6:33 PM
>>>             Subject: ngram-count
>>>             To: srilm-user at speech.sri.com
>>>             <mailto:srilm-user at speech.sri.com>
>>>
>>>
>>>             Hi people.
>>>             I need help whith ngram-count because i am training a
>>>             model but when after i try to use it some test example
>>>             he gives me Zeroprobs in the output.
>>>             This means that the model is bad trained?
>>>             Please answer me.
>>>             Best regards,
>>>             Manuel Alves.
>>
>
>         _______________________________________________
>         SRILM-User site list
>         SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
>         http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100111/ab4b4f71/attachment.html>


More information about the SRILM-User mailing list