[SRILM User List] Question about SRILM and sentence boundary detection

Thu Feb 2 08:29:07 PST 2012

(Sorry Andreas, I meant to reply to the list):

Georgi,

I'm not sure if SRILM has something that does that -- i.e. holds the
whole LM in RAM and waits for queries.  You might need something like
that as opposed to using a whole file, if you want just the
probabilities of the last word with respect to the previous, and you
want to compare different last words depending on results of previous
calculations, for example.

I have a little C/Python tool I wrote for exactly this purpose.  It's
at https://github.com/lamber/BackOffTrigramModel

It's very specific to my work at the time.  So for example, it works
for only exactly trigrams, and it assumes you are using <unk>.  It
performs all the back-off calculations for unseen trigrams.  But it
looks like you have the same use case, so it might be useful for you.

It's not much documented, but the unit tests show how it works.

Amber
--
http://scholar.google.com/citations?user=15gGywMAAAAJ

On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> Georgi,
>
> You can get the conditional probabilities for arbitrary sets of ngrams using
>
>     ngram -counts FILE
>
> Andreas
>
>
> On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
>
> Dear Mr. Stolcke,
>
> I am trying to do sentence boundary segmentation. I have an n-gram language
> model and for modelling it I use the SRILM toolkit. Thanks for the nice
> tool!
>
> I have the following problem.
>
> I implement the forward-backward algortithm on my own. So I need to combine
> the n-grams of your "hidden event model" with the prosodic model.
> Therefore, I need to get the probabilities of the individual n-grams (in my
> case 3-grams).
>
> For example for the word sequence
> wordt_2 wordt_1 wordt wordt+1 wordt+2
>
> i need
> P( <s> , wordt | wordt_2 wordt_1)
> P(wordt | wordt_2 wordt_1)
> P(wordt+1 | wordt_1 wordt)
> ... and so on
> all possible combinations with and without <s> before each word.
>
>
> What I do to get one of these is to use the following SRILM command:
>
> # create text for case *wordt_2 wordt_1 <s> wordt*
>> echo "$wordt_2 $wordt_1
>> $wordt" > testtext2;
>
>> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk >/tmp/output;
> and then read the corresponding line from the output that I need (e.g. line
> 3 )
>
>
>
> OUTPUT:
> wordt_2 wordt_1
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( <unk> | <unk> ...) = [2gram] 0.00343115 [ -2.46456 ]
> p( </s> | <unk> ...) = [2gram] 0.0937662 [ -1.02795 ]
> 1 sentences, 2 words, 0 OOVs
> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>
> wordt
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( </s> | <unk> ...) = [2gram] 0.10582 [ -0.975432 ]
> 1 sentences, 1 words, 0 OOVs
> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>
> file testtext2: 2 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
> --------------------------------
>
>
>
> The problem is that for each trigram I call again ngram function and it
> loads the LM ( > 1GB) and this makes it very slow.
> Is there a faster solution? I do not need perplexity as well.
>
> I know about the segmentation tool
> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html
>   but it gives results for the whole sequence, which is not my goal.
>
>
>
>
> mit freundlichen Grüßen,
> Georgi Dzhambazov,
>
> Studentischer Mitarbeiter,
> NetMedia
> ________________________________________
> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
> Gesendet: Donnerstag, 13. Oktober 2011 05:50
> Bis: Dzhambazov, Georgi
> Cc: eee at speech.sri.com
> Betreff: Re: Question about sentence boundary detection paper
>
> Dzhambazov, Georgi wrote:
>> Dear A. Stolcke,
>> Dear E. Shriberg,
>>
>>
>> I am interested in your approach of sentence boundary detection.
>> I would be very happy if you find some time to clarify me some of the
>> steps of your approach.
>> I plan to implement them.
>>
>> Question 1)
>> In the paper (1) at paragraph 2.2.1 you say that states are "the end
>> of sentence status of each word plus any preceeding words.
>> So for example at position 4 of the example sentence, the state is (
>> <ns> + quick brown fox). At position 6 the state is (<s> + brown fox
>> flies ) .
>> This means a huge state space. Is this right?
>>
>> 1 2 3 4 5 6 7 8 9 10
>>
>> The quick brown fox flies <s> The rabbit is white.
> The state space is potentially huge, but just like in standard N-gram
> LMs you only consider the histories (= states) actually occurring in the
> training data, and handle any new histories through backoff.
> Furthermore, the state space is constrained to those that match the
> ngrams in the word sequence. So for every word position you have to
> consider only two states (<s> and no-<s>).
>>
>> Question 2)
>> Transitions probabilities are N-gram Probabilities. You give an
>> example with bigram probabilities in the next line.
>> However you say as well you are using a 4-gram LM. So the correct
>> example should be:
>> This means that a probability at position 6 is Pr(<s>|brown fox flies)
>> and at position 4 is Pr( <ns> | quick brown fox)
>> Is this right?
> correct.
>>
>> Question 3)
>> Then for recognition you say that the forward-backward algorithm is
>> used to determine the maximal P (T_i | W )
>> where T_i corresponds to <s> or <ns> at position i. However the
>> transition probabilities include information about states like ( <ns>
>> + quick brown fox).
>> How do you apply the transition probabilities in this model. Does it
>> relate to the formula of section 4 ot (2).
>> I think this formula can work for the forward-backward algorithm,
>> although it is stated in this section 4 that it is used for Viterbi.
> For finding the most probable T_i you use in fact the Viterbi algorithm.
>
> The formulas in section 4 just give one step in the forward computation
> that would be used in the Viterbi algorithm.
>
> Please note that this is all implemented in the "segment" tool that
> comes with SRILM.
> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>
> Andreas
>
>>
>> References:
>>
>> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
>> Speech into sentences and topics
>> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
>> conversational speech
>>
>> Thank you!
>>
>> Kind Regards,
>> Georgi Dzhambazov,
>>
>> Studentischer Mitarbeiter,
>> NetMedia
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ