[SRILM User List] Fwd: Fwd: ngram-count
Andreas Stolcke
stolcke at speech.sri.com
Mon Jan 18 10:21:18 PST 2010
On 1/15/2010 11:07 AM, Manuel Alves wrote:
> Hi people.
> 1. The LM is strange because of the filtering options since in the
> training corpus the setences begin with <s> and end with </s>,
> perhaps it is because of this.
I'm not sure what filtering options you are referring to, but having <s>
and </s> around every sentence is not a problem.
If you don't put them in yourself, ngram-count will add them, so it
doesn't make a difference.
> 2. The training corpus has 224884192 words.
> 3.
> reading 2534558 1-grams
> reading 5070525 2-grams
> reading 514318 3-grams
You have a good-sized corpus, but also a huge vocabulary, so no wonder
you get some OOVs (i.e., the number of unique words seems to grow fast
as a function of text length).
You might be able to reduce your vocabulary by mapping all words to
lower-case, or do other text conditioning steps, like eliminating
sources that might contain non-textual data (eg.,tables, numbers) or
misspellings.
> 4.You suspect of what in the training data.
I'm not sure what you mean here.
> 5.I am working in a translation system and i want to know if it makes
> sense to have a word that has zeroprob(prob=0) just because the word
> does not exists in the training corpus but exist in the test corpus
> and if the -unk tag in the ngram-count command solves the problem?
In that case you really want to use -unk in both training and test.
This will assign some non-zero probability to previously unseen words.
However, you need to take steps to ensure that the training corpus
contains words NOT in your vocabulary, so actual instances of <unk>
occur for estimation purposes. Please read the items relating to
open-vocabulary LM in the FAQ.
> 6. If the -unk tag and the discount methods do not solve this problem
> tell me how do i do to solve it?
A good sanity check is to compute the perplexity of (a sample of) your
training data. This should be much lower than your test set
perplexity. If not then you have a problem in your LM training and/or
test procedure. If the training ppl is low but the test ppl is high
then your test data is just poorly matched to your training.
Andreas
>
>
> Best Regards,
> Manuel.
>
>
>
> On Thu, Jan 14, 2010 at 6:01 PM, Andreas Stolcke
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
> On 1/14/2010 8:49 AM, Manuel Alves wrote:
>
> p( </s> | . ...) = 0.999997 [ -1.32346e-06 ]
>
>
> You have a very strange LM since almost all the probability mass
> in your LM is on the end-of-sentence tag.
> How many words are in your training corpus?
> How many unigrams, bigrams, and trigrams are in your LM?
> I suspect some basic with the preparation of your training data.
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100118/4675cc5d/attachment.html>
More information about the SRILM-User
mailing list