begining/end of sentence tags

Andreas Stolcke stolcke at speech.sri.com
Thu Apr 24 15:31:07 PDT 2008


In message <20080423225348.cohg3783y8k0g8oc at webmail.zcu.cz>you wrote:
> Dmitriy,
>   you can use the "continuous-ngram-count" script to generate counts  
> not containing sentence boundary tags. It can be combined with  
> ngram-count, such as
> 
> 'continuous-ngram-count order=3 train.txt | ngram-count -read - -lm lm3gram'
> 
> Best,
>   Jachym

Jachym is right, and you can use a similar approach for testing 
the LM (using continuous-ngram-count and ngram-count -counts).

I suspect that Dmitriy wants to preserve sentences as units,
and just needs to avoid <s> and </s> being added automatically.
This is also possible, by counting the ngrams first, and then filtering 
out those that have the start/end tags.

However, it is much easier to use the latest beta verison of SRILM
(now on the web server) that has the options 

	-no-sos
	-no-eos

for ngram and ngram-count.

Andreas 

> 
> Quoting Dmitriy Dligach <Dmitriy.Dligach at colorado.edu>:
> 
> > Andreas,
> >
> > First of all I wanted to thank you for your SRILM toolkit; I find it
> > extremely useful in my research!
> >
> > Also, I had a question about the beginning/end of sentence tags:
> >
> > I need to compute probabilities of strings that are *not* complete
> > sentences. My understanding is both 'ngram-count' and 'ngram' tools
> > automatically add these tags if they are not explicitly present.
> >
> > Is there any way to prevent the 'ngram' tool from doing so?
> >
> > Perhaps the '-limit-vocab' option can somehow help by specifying all
> > words in the vocabulary except for the <s> and </s>?
> >
> > Thanks,
> >
> >
> > Dima
> 
> 
> 




More information about the SRILM-User mailing list