[SRILM User List] Question about SRILM and sentence boundary detection

Tue Feb 14 04:54:31 PST 2012

On Sun, Feb 12, 2012 at 6:37 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> From: L. Amber Wilcox-O'Hearn <amber.wilcox.ohearn at gmail.com>
>
>
> Thank you, Andreas.  I wasn't aware of these capabilities.
>
> The server-port worked exactly as expected.  That is, if I give it w1
> w2 w3, it returns p(w3|w1w2).  Combined with the caching, it looks
> very promising for my applications.
>
> The other solution using -counts (or actually -ppl for my case) also
> worked, but of course if I give it w1 w2 w3, it returns the
> probability of that whole string, i.e.  p(w1) * p(w2|w1) * p(w3|w1w2),
> which would be redundant for my purposes.
>
> That's not correct.    ngram -counts will output CONDITIONAL ngram
> probabilities.
> -counts countsfile Perform a computation similar to -ppl, but based only on
> the N-gram counts found in countsfile. Probabilities are computed for the
> last word of each N-gram, using the other words as contexts, and scaling by
> the associated N-gram count. Summary statistics are output at the end, as
> well as before each escaped input line. So it should do exactly what you
> need.

I see.   I misunderstood the difference between -ppl and -counts.

I did try this and the summary statistics at the end gave the correct
sum, but there weren't any statistics output before the escaped lines:
> cat testcounts | ngram -lm LM -escape "===" -counts - -unk
===
===
===
file -: 0 sentences, 4 words, 0 OOVs
0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452

Did I miss something?

Amber
-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ