[SRILM User List] Question about SRILM and sentence boundary detection
stolcke at icsi.berkeley.edu
Thu Feb 2 16:53:07 PST 2012
On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote:
> (Sorry Andreas, I meant to reply to the list):
> I'm not sure if SRILM has something that does that -- i.e. holds the
> whole LM in RAM and waits for queries. You might need something like
> that as opposed to using a whole file, if you want just the
> probabilities of the last word with respect to the previous, and you
> want to compare different last words depending on results of previous
> calculations, for example.
Two SRILM solutions:
1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put
an escape line (in this case, starting with "===") after every ngram in
the input (make sure the ngram words are followed my a count "1").
This will cause ngram to dump out the conditional prob for the ngram
right away (instead of waiting for end-of-file).
2. Directly access the network LM server protocol implemented by ngram
Start the server with
% ngram -lm LM -server-port 8888
then write ngrams to that TCP port and read back the log probs:
% telnet localhost 8888
my first word << input
-4.6499 >> output
Of course you would do the equivalent of telnet in perl, python, C, or
some other language to make use of the probabilities.
> I have a little C/Python tool I wrote for exactly this purpose. It's
> at https://github.com/lamber/BackOffTrigramModel
> It's very specific to my work at the time. So for example, it works
> for only exactly trigrams, and it assumes you are using<unk>. It
> performs all the back-off calculations for unseen trigrams. But it
> looks like you have the same use case, so it might be useful for you.
> It's not much documented, but the unit tests show how it works.
> On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke
> <stolcke at icsi.berkeley.edu> wrote:
>> You can get the conditional probabilities for arbitrary sets of ngrams using
>> ngram -counts FILE
>> On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
>> Dear Mr. Stolcke,
>> I am trying to do sentence boundary segmentation. I have an n-gram language
>> model and for modelling it I use the SRILM toolkit. Thanks for the nice
>> I have the following problem.
>> I implement the forward-backward algortithm on my own. So I need to combine
>> the n-grams of your "hidden event model" with the prosodic model.
>> Therefore, I need to get the probabilities of the individual n-grams (in my
>> case 3-grams).
>> For example for the word sequence
>> wordt_2 wordt_1 wordt wordt+1 wordt+2
>> i need
>> P(<s> , wordt | wordt_2 wordt_1)
>> P(wordt | wordt_2 wordt_1)
>> P(wordt+1 | wordt_1 wordt)
>> ... and so on
>> all possible combinations with and without<s> before each word.
>> What I do to get one of these is to use the following SRILM command:
>> # create text for case *wordt_2 wordt_1<s> wordt*
>>> echo "$wordt_2 $wordt_1
>>> $wordt"> testtext2;
>>> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk>/tmp/output;
>> and then read the corresponding line from the output that I need (e.g. line
>> 3 )
>> wordt_2 wordt_1
>> p(<unk> |<s> ) = [2gram] 0.00235274 [ -2.62843 ]
>> p(<unk> |<unk> ...) = [2gram] 0.00343115 [ -2.46456 ]
>> p(</s> |<unk> ...) = [2gram] 0.0937662 [ -1.02795 ]
>> 1 sentences, 2 words, 0 OOVs
>> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>> p(<unk> |<s> ) = [2gram] 0.00235274 [ -2.62843 ]
>> p(</s> |<unk> ...) = [2gram] 0.10582 [ -0.975432 ]
>> 1 sentences, 1 words, 0 OOVs
>> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>> file testtext2: 2 sentences, 3 words, 0 OOVs
>> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
>> The problem is that for each trigram I call again ngram function and it
>> loads the LM (> 1GB) and this makes it very slow.
>> Is there a faster solution? I do not need perplexity as well.
>> I know about the segmentation tool
>> but it gives results for the whole sequence, which is not my goal.
>> mit freundlichen Grüßen,
>> Georgi Dzhambazov,
>> Studentischer Mitarbeiter,
>> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
>> Gesendet: Donnerstag, 13. Oktober 2011 05:50
>> Bis: Dzhambazov, Georgi
>> Cc: eee at speech.sri.com
>> Betreff: Re: Question about sentence boundary detection paper
>> Dzhambazov, Georgi wrote:
>>> Dear A. Stolcke,
>>> Dear E. Shriberg,
>>> I am interested in your approach of sentence boundary detection.
>>> I would be very happy if you find some time to clarify me some of the
>>> steps of your approach.
>>> I plan to implement them.
>>> Question 1)
>>> In the paper (1) at paragraph 2.2.1 you say that states are "the end
>>> of sentence status of each word plus any preceeding words.
>>> So for example at position 4 of the example sentence, the state is (
>>> <ns> + quick brown fox). At position 6 the state is (<s> + brown fox
>>> flies ) .
>>> This means a huge state space. Is this right?
>>> 1 2 3 4 5 6 7 8 9 10
>>> The quick brown fox flies<s> The rabbit is white.
>> The state space is potentially huge, but just like in standard N-gram
>> LMs you only consider the histories (= states) actually occurring in the
>> training data, and handle any new histories through backoff.
>> Furthermore, the state space is constrained to those that match the
>> ngrams in the word sequence. So for every word position you have to
>> consider only two states (<s> and no-<s>).
>>> Question 2)
>>> Transitions probabilities are N-gram Probabilities. You give an
>>> example with bigram probabilities in the next line.
>>> However you say as well you are using a 4-gram LM. So the correct
>>> example should be:
>>> This means that a probability at position 6 is Pr(<s>|brown fox flies)
>>> and at position 4 is Pr(<ns> | quick brown fox)
>>> Is this right?
>>> Question 3)
>>> Then for recognition you say that the forward-backward algorithm is
>>> used to determine the maximal P (T_i | W )
>>> where T_i corresponds to<s> or<ns> at position i. However the
>>> transition probabilities include information about states like (<ns>
>>> + quick brown fox).
>>> How do you apply the transition probabilities in this model. Does it
>>> relate to the formula of section 4 ot (2).
>>> I think this formula can work for the forward-backward algorithm,
>>> although it is stated in this section 4 that it is used for Viterbi.
>> For finding the most probable T_i you use in fact the Viterbi algorithm.
>> The formulas in section 4 just give one step in the forward computation
>> that would be used in the Viterbi algorithm.
>> Please note that this is all implemented in the "segment" tool that
>> comes with SRILM.
>> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
>> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>>> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
>>> Speech into sentences and topics
>>> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
>>> conversational speech
>>> Thank you!
>>> Kind Regards,
>>> Georgi Dzhambazov,
>>> Studentischer Mitarbeiter,
>> SRILM-User site list
>> SRILM-User at speech.sri.com
More information about the SRILM-User