From logproba on sentences to logproba on words

Wed Jan 30 06:45:09 PST 2008

Thanks Eric for your response.

The problem with doing that is that it supposes an equally  
redistributed probability for all n-grams of a sentence. Adding a 1  
for unigram and for a bi-gram means that the 2 grams contribute  
equiprobably to the sentence probability while that's not true.
May be I should first compute the probability of each word.

1/ ngram-count corpus.txt -lm wordmodel.lm

2/ ngram -lm wordmodel.lm -ppl corpus.txt -debug 2

Such that I obtain now for each word of the corpus the log- 
probability. (without taking into account OOV words)
And then for taking into account the priors  proba (of sentences)  
simply multiply each word by the sum of probabilities of sentences it  
appears into.

Do you agree with that idea ?

Le 29-janv.-08 à 20:37, Joanis, Eric a écrit :

> Dear Amin,
>
> I would use a variant of 2):  produce a count file, and *replace* the
> counts by the sum of probabilities of the sentences where a given n- 
> gram
> occurs.
>
> The default way to count adds 1 for each occurrence, which makes sense
> when the distribution is assumed to be uniform over the observed data.
> With your data, you can replace these 1's by the actual probability
> figures you have.  You may have to worry about underflow issues when
> tallying small numbers, but otherwise the process should be simple
> enough.  You may also need to renormalize all the counts so that the
> smallest count be equal to 1, depending on which discounting scheme  
> you
> use.  Not all discounting methods take float counts, so rounding may
> also be necessary.
>
> By the way, with your modified definition of the problem, I would
> probably write my own program to build the count file, and then invoke
> the SRILM utilities afterwards for building the LM from the counts.
>
> Cheers,
>
> Eric
>
> ____________________________________________________
> Eric Joanis
> CNRC - ITI - GTLI | NRC - IIT - ILT
>
>
>> -----Original Message-----
>> From: owner-srilm-user at speech.sri.com
>> [mailto:owner-srilm-user at speech.sri.com] On Behalf Of Amin Mantrach
>> Sent: January 29, 2008 1:35 PM
>> To: srilm-user at speech.sri.com
>> Subject: Fw: From logproba on sentences to logproba on words
>>
>>
>> Apparently my question doesn't meet any answer, so I'll
>> reformulate it
>> in order to be more clear.
>>
>> Actually, I want to create an LM model with the command > # ngram-
>> count -text textfile -lm lmfile
>>
>>
>> In the case I'm concerned with I dispose of the
>> log-probabilities for
>> every sentences  of appearing. The same that you can obtain from
>> (#ngram -lm lm_file -debug 1 -ppl testfile)
>>
>> What I want to do ? Create a new LM file build from probabilities on
>> sentences I have.
>>
>> Current ideas :
>>
>> 1 / Produce a text file with the sentences. Each sentence can appear
>> in file multiple times. It will appear in fact exactly n
>> times.  Where
>> n = exp(log-proba of the sentence)*1000) (Rounded to integer).
>>
>> And then simply :  ngram-count -text newtextsentences -lm new_lm
>>
>> 2 /  Produce a count file (with only the counts needed (of
>> the highest
>> order, etc.) and for each n-gram multiply the nb of
>> occurrence by the
>> sum of proba of the sentences it belongs to.
>> This methods is clearly not fair.
>>
>>
>> Can you answer me if one of those ideas are correct. If not
>> how should
>> I proceed.
>>
>>
>> I hope the question in now clear enough.
>>
>> Thanks a lot for your help.
>> Amin.
>>
>>
>>