-tagged option?

Andreas Stolcke stolcke at speech.sri.com
Fri Jun 24 16:23:30 PDT 2005


In message <200505172045.40590.gemma.boleda at upf.edu>you wrote:
> Hi,
> 
> I am using the -tagged option for ngram-count and I am experiencing 2 
> problems:
> 
> a) the slash is taken into account in the ngram counts: taking as input "la/D
> T 
> nena/N5 és/V maca/JQ ./PT", the bigrams look as follows:
> 
> <s> la	1
> <s> /DT	1
> la nena	1
> nena és	1
> és maca	1
> /N5 és	1
> /N5 /V	1
> /V maca	1
> /V /JQ	1
> /DT nena	1
> /DT /N5	1
> maca .	1
> /JQ .	1
> /JQ /PT	1
> . </s>	1
> /PT </s>	1
> 
> Why is the slash considered as part of the tag?

The / in front of a token signifies that it's a tag, as opposed to a
word.  It's just a way to encode word/tags, as well as 
word and tags individually, without ambiguity.

> 
> b) as can be seen in the example, the n-grams with tags are only built 
> left-to-right, e.g. there is no bigram "la /N5", as I would have expected 
> (and needed).

The program collects only those N-gram statistics that are required 
by the underlying model.  Since the goal is to use the tags in backoff
the statistics needed are asymmetrical.

If you want a different set of N-grams you can probably write a simple
perl script to do the job.

--Andreas 




More information about the SRILM-User mailing list