[SRILM User List] Question about -prune-lowprobs and -text-has-weights
Andreas Stolcke
stolcke at icsi.berkeley.edu
Wed Aug 8 11:57:27 PDT 2012
On 8/8/2012 3:31 AM, Meng Chen wrote:
> Hi, the*-prune-lowprobs* option in*ngram* will "prune N-gram
> probabilities that are lower than the corresponding backed-off
> estimates". This option would be useful especially when the
> back-off-weight (bow) value is positive. However, I want to ask if I
> could simply replace the positive bow value with 0 instead of using
> prune-lowprobs. Are there any differences? Or replace simply is not
> correct?
It's not correct. If you modify the backoff weight you end up with an
LM that is no longer normalized (word probs for a given context don't
sum to 1).
>
> Another question:
> When training LM, we could use*-text-has-weights* option for the
> corpus with sentence frequency. I want to ask what we should do with
> the*duplicated sentences* in large corpus. Should I delete the
> duplicated sentences? Or should I calculate the sentence frequency
> first and use the -text-has-weights option instead? Or do nothing,
> just throw all the corpus into training?
You can do either. Have a duplicated sentence
1.0 a b c
1.0 a b c
is equivalent to having the sentence once with added weights:
2.0 a b c
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120808/433ccd1a/attachment.html>
More information about the SRILM-User
mailing list