[SRILM User List] Question about SRILM and sentence boundary detection

Andreas Stolcke stolcke at icsi.berkeley.edu
Thu Feb 2 16:53:07 PST 2012


On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote:
> (Sorry Andreas, I meant to reply to the list):
>
> Georgi,
>
> I'm not sure if SRILM has something that does that -- i.e. holds the
> whole LM in RAM and waits for queries.  You might need something like
> that as opposed to using a whole file, if you want just the
> probabilities of the last word with respect to the previous, and you
> want to compare different last words depending on results of previous
> calculations, for example.
Two SRILM solutions:

1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put 
an escape line (in this case, starting with "===") after every ngram in 
the input (make sure the ngram words are followed my a count "1").
This will cause ngram to dump out the conditional prob for the ngram 
right away (instead of waiting for end-of-file).

2. Directly access the network LM server protocol implemented by ngram 
-server-port.
Start the server with
         % ngram -lm LM -server-port 8888
then write ngrams to that TCP port and read back the log probs:

     % telnet localhost 8888
my first word << input
-4.6499 >> output

Of course you would do the equivalent of telnet in perl, python, C,  or 
some other language to make use of the probabilities.

Andreas




>
> I have a little C/Python tool I wrote for exactly this purpose.  It's
> at https://github.com/lamber/BackOffTrigramModel
>
> It's very specific to my work at the time.  So for example, it works
> for only exactly trigrams, and it assumes you are using<unk>.  It
> performs all the back-off calculations for unseen trigrams.  But it
> looks like you have the same use case, so it might be useful for you.
>
> It's not much documented, but the unit tests show how it works.
>
> Amber
> --
> http://scholar.google.com/citations?user=15gGywMAAAAJ
>
> On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke
> <stolcke at icsi.berkeley.edu>  wrote:
>> Georgi,
>>
>> You can get the conditional probabilities for arbitrary sets of ngrams using
>>
>>      ngram -counts FILE
>>
>> Andreas
>>
>>
>> On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
>>
>> Dear Mr. Stolcke,
>>
>> I am trying to do sentence boundary segmentation. I have an n-gram language
>> model and for modelling it I use the SRILM toolkit. Thanks for the nice
>> tool!
>>
>> I have the following problem.
>>
>> I implement the forward-backward algortithm on my own. So I need to combine
>> the n-grams of your "hidden event model" with the prosodic model.
>> Therefore, I need to get the probabilities of the individual n-grams (in my
>> case 3-grams).
>>
>> For example for the word sequence
>> wordt_2 wordt_1 wordt wordt+1 wordt+2
>>
>> i need
>> P(<s>  , wordt | wordt_2 wordt_1)
>> P(wordt | wordt_2 wordt_1)
>> P(wordt+1 | wordt_1 wordt)
>> ... and so on
>> all possible combinations with and without<s>  before each word.
>>
>>
>> What I do to get one of these is to use the following SRILM command:
>>
>> # create text for case *wordt_2 wordt_1<s>  wordt*
>>> echo "$wordt_2 $wordt_1
>>> $wordt">  testtext2;
>>> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk>/tmp/output;
>> and then read the corresponding line from the output that I need (e.g. line
>> 3 )
>>
>>
>>
>> OUTPUT:
>> wordt_2 wordt_1
>> p(<unk>  |<s>  ) = [2gram] 0.00235274 [ -2.62843 ]
>> p(<unk>  |<unk>  ...) = [2gram] 0.00343115 [ -2.46456 ]
>> p(</s>  |<unk>  ...) = [2gram] 0.0937662 [ -1.02795 ]
>> 1 sentences, 2 words, 0 OOVs
>> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>>
>> wordt
>> p(<unk>  |<s>  ) = [2gram] 0.00235274 [ -2.62843 ]
>> p(</s>  |<unk>  ...) = [2gram] 0.10582 [ -0.975432 ]
>> 1 sentences, 1 words, 0 OOVs
>> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>>
>> file testtext2: 2 sentences, 3 words, 0 OOVs
>> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
>> --------------------------------
>>
>>
>>
>> The problem is that for each trigram I call again ngram function and it
>> loads the LM (>  1GB) and this makes it very slow.
>> Is there a faster solution? I do not need perplexity as well.
>>
>> I know about the segmentation tool
>> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html
>>    but it gives results for the whole sequence, which is not my goal.
>>
>>
>>
>>
>> mit freundlichen Grüßen,
>> Georgi Dzhambazov,
>>
>> Studentischer Mitarbeiter,
>> NetMedia
>> ________________________________________
>> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
>> Gesendet: Donnerstag, 13. Oktober 2011 05:50
>> Bis: Dzhambazov, Georgi
>> Cc: eee at speech.sri.com
>> Betreff: Re: Question about sentence boundary detection paper
>>
>> Dzhambazov, Georgi wrote:
>>> Dear A. Stolcke,
>>> Dear E. Shriberg,
>>>
>>>
>>> I am interested in your approach of sentence boundary detection.
>>> I would be very happy if you find some time to clarify me some of the
>>> steps of your approach.
>>> I plan to implement them.
>>>
>>> Question 1)
>>> In the paper (1) at paragraph 2.2.1 you say that states are "the end
>>> of sentence status of each word plus any preceeding words.
>>> So for example at position 4 of the example sentence, the state is (
>>> <ns>  + quick brown fox). At position 6 the state is (<s>  + brown fox
>>> flies ) .
>>> This means a huge state space. Is this right?
>>>
>>> 1 2 3 4 5 6 7 8 9 10
>>>
>>> The quick brown fox flies<s>  The rabbit is white.
>> The state space is potentially huge, but just like in standard N-gram
>> LMs you only consider the histories (= states) actually occurring in the
>> training data, and handle any new histories through backoff.
>> Furthermore, the state space is constrained to those that match the
>> ngrams in the word sequence. So for every word position you have to
>> consider only two states (<s>  and no-<s>).
>>> Question 2)
>>> Transitions probabilities are N-gram Probabilities. You give an
>>> example with bigram probabilities in the next line.
>>> However you say as well you are using a 4-gram LM. So the correct
>>> example should be:
>>> This means that a probability at position 6 is Pr(<s>|brown fox flies)
>>> and at position 4 is Pr(<ns>  | quick brown fox)
>>> Is this right?
>> correct.
>>> Question 3)
>>> Then for recognition you say that the forward-backward algorithm is
>>> used to determine the maximal P (T_i | W )
>>> where T_i corresponds to<s>  or<ns>  at position i. However the
>>> transition probabilities include information about states like (<ns>
>>> + quick brown fox).
>>> How do you apply the transition probabilities in this model. Does it
>>> relate to the formula of section 4 ot (2).
>>> I think this formula can work for the forward-backward algorithm,
>>> although it is stated in this section 4 that it is used for Viterbi.
>> For finding the most probable T_i you use in fact the Viterbi algorithm.
>>
>> The formulas in section 4 just give one step in the forward computation
>> that would be used in the Viterbi algorithm.
>>
>> Please note that this is all implemented in the "segment" tool that
>> comes with SRILM.
>> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
>> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>>
>> Andreas
>>
>>> References:
>>>
>>> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
>>> Speech into sentences and topics
>>> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
>>> conversational speech
>>>
>>> Thank you!
>>>
>>> Kind Regards,
>>> Georgi Dzhambazov,
>>>
>>> Studentischer Mitarbeiter,
>>> NetMedia
>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>



More information about the SRILM-User mailing list