[SRILM User List] Question about -prune-lowprobs and -text-has-weights

Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Aug 8 11:57:27 PDT 2012

On 8/8/2012 3:31 AM, Meng Chen wrote:
> Hi, the*-prune-lowprobs* option in*ngram* will  "prune N-gram 
> probabilities that are lower than the corresponding backed-off 
> estimates". This option would be useful especially when the 
> back-off-weight (bow) value is positive. However, I want to ask if I 
> could simply replace the positive bow value with 0 instead of using 
> prune-lowprobs. Are there any differences? Or replace simply is not 
> correct?
It's not correct.  If you modify the backoff weight you end up with an 
LM that is no longer normalized (word probs for a given context don't 
sum to 1).
> Another question:
> When training LM, we could use*-text-has-weights* option for the 
> corpus with sentence frequency. I want to ask what we should do with 
> the*duplicated sentences* in large corpus. Should I delete the 
> duplicated sentences? Or should I calculate the sentence frequency 
> first and use the -text-has-weights option instead? Or do nothing, 
> just throw all the corpus into training?
You can do either.   Have a duplicated sentence

1.0 a b c
1.0 a b c

is equivalent to having the sentence once with added weights:

2.0 a b c


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120808/433ccd1a/attachment.html>

More information about the SRILM-User mailing list