Question about hidden-ngram

Fri Nov 21 09:23:35 PST 2003

In message <009601c3b033$e26f9510$fcabca18 at Beige>you wrote:
> Try the flag -force-event for hidden-ngram:
> 
> hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab
> tags -continuous -posteriors -force-event
> 

The -force-event flag is only appropriate if you encode the absence
of punctuation by a special tag, too.

I suspect the problem is in the training of the LM.
Your training data sample has a single sentence split across 3 lines.
Yet the standard behavior of ngram-count is that each line represents
one sentence, so the <s> and </s> tags are added on each line.

What you need to do to match the hidden-ngram -continous way of 
running the LM is train an LM that is trained on a continous stream of 
tokens without <s> </s> at the line breaks. You can do that like this:

continuous-ngram-count order=3 trainingtext | \
ngram-count -read - -write-vocab vocabulary -tolower -write output -lm lmfile

The continuous-ngram-count script is documented in the training-scripts(1) 
man page. It generates counts that ignore line breaks.

Hope this solves your problem.  I should note that using a word-based LM
for punctuation restoration is probably not going to work very well,
unless your vocabulary is small and/or you have tons of training data.
A class-based LM, or an interpolated word/class LM should do better.

--Andreas 

> Carmen
> 
> 
> ----- Original Message ----- 
> From: "Jachym Kolar" <jachym at kky.zcu.cz>
> To: <srilm-user at speech.sri.com>
> Sent: Friday, November 21, 2003 7:15 AM
> Subject: Question about hidden-ngram
> 
> 
> > Hi,
> >  I've just tried the hidden-ngram tool to punctuate automatically an
> > unpunctuated text. But I got some unexpected results - every word was
> tagged
> > with the *noevent*.
> >
> > I've used a training text in a following form:
> >
> > ...
> > for more than a century <COM> the fingerprint has been the quintessential
> piece
> > of crime scene evidence <PER>
> > but now the palm is getting its due <PER>
> > ...
> >
> > Then I trained a 3-gram model with:
> >
> > ngram-count -write-vocab vocabulary -tolower -text trainingtext -write
> output
> > -lm lmfile
> >
> > ... and then I used hidden-ngram tool with following option:
> >
> > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab
> tags -continuous
> > -posteriors
> >
> > ... and received something like that:
> >
> > 6        *noevent* 0.998811 <com> 0.00117427 <per> 1.46659e-05 <qm>
> 7.92597e-10
> > mìsícù   *noevent* 0.999898 <com> 9.326e-05 <per> 9.07804e-06 <qm>
> 4.61643e-10
> > do       *noevent* 1 <com> 4.19776e-09 <per> 5.76912e-09 <qm> 6.25918e-12
> > jednoho  *noevent* 0.999998 <com> 4.18691e-07 <per> 1.24419e-06 <qm>
> 8.63805e-11
> > roku     *noevent* 0.197671 <com> 0.801881 <per> 0.000340206 <qm>
> 0.000107651
> > jak      *noevent* 0.99997 <com> 2.44243e-05 <per> 1.32587e-06 <qm>
> 4.09674e-06
> > je       *noevent* 0.999857 <com> 0.000142836 <per> 2.47722e-07 <qm>
> 2.47757e-07
> > to       *noevent* 0.972235 <com> 0.0266202 <per> 0.000937748 <qm>
> 0.000206936
> > <unk>    *noevent* 0.979455 <com> 0.0205446 <per> 2.70218e-07 <qm>
> 1.33261e-07
> > uvedeno  *noevent* 0.933133 <com> 0.0538742 <per> 0.0129924 <qm>
> 6.16205e-08
> > na       *noevent* 0.999965 <com> 4.71218e-07 <per> 3.39777e-05 <qm>
> 1.57228e-07
> > výrobku  *noevent* 0.736376 <com> 0.168451 <per> 0.0947272 <qm> 0.00044499
> >
> > Please, can somebody tell me what I did wrong? And is there in SRILM a
> tool to
> > obtain a text-map from the training text?
> >
> > Thanks Jachym
> >
> >
> >
> 
>