-tagged option?
Andreas Stolcke
stolcke at speech.sri.com
Fri Jun 24 16:23:30 PDT 2005
In message <200505172045.40590.gemma.boleda at upf.edu>you wrote:
> Hi,
>
> I am using the -tagged option for ngram-count and I am experiencing 2
> problems:
>
> a) the slash is taken into account in the ngram counts: taking as input "la/D
> T
> nena/N5 és/V maca/JQ ./PT", the bigrams look as follows:
>
> <s> la 1
> <s> /DT 1
> la nena 1
> nena és 1
> és maca 1
> /N5 és 1
> /N5 /V 1
> /V maca 1
> /V /JQ 1
> /DT nena 1
> /DT /N5 1
> maca . 1
> /JQ . 1
> /JQ /PT 1
> . </s> 1
> /PT </s> 1
>
> Why is the slash considered as part of the tag?
The / in front of a token signifies that it's a tag, as opposed to a
word. It's just a way to encode word/tags, as well as
word and tags individually, without ambiguity.
>
> b) as can be seen in the example, the n-grams with tags are only built
> left-to-right, e.g. there is no bigram "la /N5", as I would have expected
> (and needed).
The program collects only those N-gram statistics that are required
by the underlying model. Since the goal is to use the tags in backoff
the statistics needed are asymmetrical.
If you want a different set of N-grams you can probably write a simple
perl script to do the job.
--Andreas
More information about the SRILM-User
mailing list