[SRILM User List] Question about SRILM and sentence boundary detection

Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Feb 1 12:44:37 PST 2012


Georgi,

You can get the conditional probabilities for arbitrary sets of ngrams using

     ngram -counts FILE

Andreas


On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
> Dear Mr. Stolcke,
>
> I am trying to do sentence boundary segmentation. I have an n-gram 
> language model and for modelling it I use the SRILM toolkit. Thanks 
> for the nice tool!
>
> I have the following problem.
>
> I implement the forward-backward algortithm on my own. So I need to 
> combine the n-grams of your "hidden event model" with the prosodic model.
> Therefore, I need to get the probabilities of the individual n-grams 
> (in my case 3-grams).
>
> For example for the word sequence
> wordt_2 wordt_1 wordt wordt+1 wordt+2
>
> i need
> P( <s> , wordt | wordt_2 wordt_1)
> P(wordt | wordt_2 wordt_1)
> P(wordt+1 | wordt_1 wordt)
> ... and so on
> all possible combinations with and without <s> before each word.
>
>
> What I do to get one of these is to use the following SRILM command:
>
> # create text for case *wordt_2 wordt_1 <s> wordt*
> > echo "$wordt_2 $wordt_1
> > $wordt" > testtext2;
>
> > ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk 
> >/tmp/output;
> and then read the corresponding line from the output that I need (e.g. 
> line 3 )
>
>
> OUTPUT:
> wordt_2 wordt_1
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( <unk> | <unk> ...) = [2gram] 0.00343115 [ -2.46456 ]
> p( </s> | <unk> ...) = [2gram] 0.0937662 [ -1.02795 ]
> 1 sentences, 2 words, 0 OOVs
> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>
> wordt
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( </s> | <unk> ...) = [2gram] 0.10582 [ -0.975432 ]
> 1 sentences, 1 words, 0 OOVs
> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>
> file testtext2: 2 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
> --------------------------------
>
>
>
> The problem is that for each trigram I call again ngram function and 
> it loads the LM ( > 1GB) and this makes it very slow.
> Is there a faster solution? I do not need perplexity as well.
>
> I know about the segmentation tool
> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html
>   but it gives results for the whole sequence, which is not my goal.
>
>
>
>
> mit freundlichen Grüßen,
> Georgi Dzhambazov,
>
> Studentischer Mitarbeiter,
> NetMedia
> ________________________________________
> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
> Gesendet: Donnerstag, 13. Oktober 2011 05:50
> Bis: Dzhambazov, Georgi
> Cc: eee at speech.sri.com
> Betreff: Re: Question about sentence boundary detection paper
>
> Dzhambazov, Georgi wrote:
> > Dear A. Stolcke,
> > Dear E. Shriberg,
> >
> >
> > I am interested in your approach of sentence boundary detection.
> > I would be very happy if you find some time to clarify me some of the
> > steps of your approach.
> > I plan to implement them.
> >
> > Question 1)
> > In the paper (1) at paragraph 2.2.1 you say that states are "the end
> > of sentence status of each word plus any preceeding words.
> > So for example at position 4 of the example sentence, the state is (
> > <ns> + quick brown fox). At position 6 the state is (<s> + brown fox
> > flies ) .
> > This means a huge state space. Is this right?
> >
> > 1 2 3 4 5 6 7 8 9 10
> >
> > The quick brown fox flies <s> The rabbit is white.
> The state space is potentially huge, but just like in standard N-gram
> LMs you only consider the histories (= states) actually occurring in the
> training data, and handle any new histories through backoff.
> Furthermore, the state space is constrained to those that match the
> ngrams in the word sequence. So for every word position you have to
> consider only two states (<s> and no-<s>).
> >
> > Question 2)
> > Transitions probabilities are N-gram Probabilities. You give an
> > example with bigram probabilities in the next line.
> > However you say as well you are using a 4-gram LM. So the correct
> > example should be:
> > This means that a probability at position 6 is Pr(<s>|brown fox flies)
> > and at position 4 is Pr( <ns> | quick brown fox)
> > Is this right?
> correct.
> >
> > Question 3)
> > Then for recognition you say that the forward-backward algorithm is
> > used to determine the maximal P (T_i | W )
> > where T_i corresponds to <s> or <ns> at position i. However the
> > transition probabilities include information about states like ( <ns>
> > + quick brown fox).
> > How do you apply the transition probabilities in this model. Does it
> > relate to the formula of section 4 ot (2).
> > I think this formula can work for the forward-backward algorithm,
> > although it is stated in this section 4 that it is used for Viterbi.
> For finding the most probable T_i you use in fact the Viterbi algorithm.
>
> The formulas in section 4 just give one step in the forward computation
> that would be used in the Viterbi algorithm.
>
> Please note that this is all implemented in the "segment" tool that
> comes with SRILM.
> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>
> Andreas
>
> >
> > References:
> >
> > 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
> > Speech into sentences and topics
> > 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
> > conversational speech
> >
> > Thank you!
> >
> > Kind Regards,
> > Georgi Dzhambazov,
> >
> > Studentischer Mitarbeiter,
> > NetMedia
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120201/65860552/attachment.html>


More information about the SRILM-User mailing list