[SRILM User List] Fwd: Fwd: ngram-count

Fri Jan 15 11:07:54 PST 2010

Hi people.
1. The LM is strange because of the filtering options since in the training
corpus the setences begin with <s> and end with </s>,
perhaps it is because of this.
2. The training corpus has 224884192 words.
3.
reading 2534558 1-grams
reading 5070525 2-grams
reading 514318 3-grams
4.You suspect of what in the training data.
5.I am working in a translation system and i want to know if it makes sense
to have a word that has zeroprob(prob=0) just because the word does not
exists in the training corpus but exist in the test corpus and if the -unk
tag in the ngram-count command solves the problem?
6. If the -unk tag and the discount methods do not solve this problem tell
me how do i do to solve it?

Best Regards,
Manuel.

On Thu, Jan 14, 2010 at 6:01 PM, Andreas Stolcke <stolcke at speech.sri.com>wrote:

> On 1/14/2010 8:49 AM, Manuel Alves wrote:
>
>>    p( </s> | . ...)     =  0.999997 [ -1.32346e-06 ]
>>
>
> You have a very strange LM since almost all the probability mass in your LM
> is on the end-of-sentence tag.
> How many words are in your training corpus?
> How many unigrams, bigrams, and trigrams are in your LM?
> I suspect some basic with the preparation of your training data.
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100115/f8797481/attachment.html>