[SRILM User List] Question about SRILM and sentence boundary detection

Sat Feb 11 10:10:28 PST 2012

On Thu, Feb 2, 2012 at 5:53 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote:

>>
>> I'm not sure if SRILM has something that does that -- i.e. holds the
>> whole LM in RAM and waits for queries.  You might need something like
>> that as opposed to using a whole file, if you want just the
>> probabilities of the last word with respect to the previous, and you
>> want to compare different last words depending on results of previous
>> calculations, for example.
>
> Two SRILM solutions:
>
> 1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put an
> escape line (in this case, starting with "===") after every ngram in the
> input (make sure the ngram words are followed my a count "1").
> This will cause ngram to dump out the conditional prob for the ngram right
> away (instead of waiting for end-of-file).
>
> 2. Directly access the network LM server protocol implemented by ngram
> -server-port.
> Start the server with
>        % ngram -lm LM -server-port 8888
> then write ngrams to that TCP port and read back the log probs:
>
>    % telnet localhost 8888
> my first word << input
> -4.6499 >> output
>
> Of course you would do the equivalent of telnet in perl, python, C,  or some
> other language to make use of the probabilities.

Thank you, Andreas.  I wasn't aware of these capabilities.

The server-port worked exactly as expected.  That is, if I give it w1
w2 w3, it returns p(w3|w1w2).  Combined with the caching, it looks
very promising for my applications.

The other solution using -counts (or actually -ppl for my case) also
worked, but of course if I give it w1 w2 w3, it returns the
probability of that whole string, i.e.  p(w1) * p(w2|w1) * p(w3|w1w2),
which would be redundant for my purposes.

I ran
> cat input_text | ngram -lm my_lm -escape "===" -ppl - -unk -no-sos -no-eos
where input_text looked like:
w1 w2 w3
===
w1 w2 w3'

Still, I'm glad it was brought up, because SRILM has so much
functionality, that I had overlooked something directly useful to me.

Amber
-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ