Question about hidden-ngram

Fri Nov 21 09:43:38 PST 2003

I guess Andreas has probably answered some of your questions, just want to
add something that he skipped.

Here is one line from your output:
roku     *noevent* 0.197671 <com> 0.801881 <per> 0.000340206 <qm>
0.000107651
which means that at that interword boundary, comma is the mostly like
punctuation (those numbers are the posterior probability for each tag),
yet you said that you got 'noevent' for every location.
Your output looks okay to me.

--Yang

On Fri, 21 Nov 2003, Andreas Stolcke wrote:

>
>
> In message <009601c3b033$e26f9510$fcabca18 at Beige>you wrote:
> > Try the flag -force-event for hidden-ngram:
> >
> > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab
> > tags -continuous -posteriors -force-event
> >
>
> The -force-event flag is only appropriate if you encode the absence
> of punctuation by a special tag, too.
>
> I suspect the problem is in the training of the LM.
> Your training data sample has a single sentence split across 3 lines.
> Yet the standard behavior of ngram-count is that each line represents
> one sentence, so the <s> and </s> tags are added on each line.
>
> What you need to do to match the hidden-ngram -continous way of
> running the LM is train an LM that is trained on a continous stream of
> tokens without <s> </s> at the line breaks. You can do that like this:
>
> continuous-ngram-count order=3 trainingtext | \
> ngram-count -read - -write-vocab vocabulary -tolower -write output -lm lmfile
>
> The continuous-ngram-count script is documented in the training-scripts(1)
> man page. It generates counts that ignore line breaks.
>
> Hope this solves your problem.  I should note that using a word-based LM
> for punctuation restoration is probably not going to work very well,
> unless your vocabulary is small and/or you have tons of training data.
> A class-based LM, or an interpolated word/class LM should do better.
>
> --Andreas
>
>
> > Carmen
> >
> >
> > ----- Original Message -----
> > From: "Jachym Kolar" <jachym at kky.zcu.cz>
> > To: <srilm-user at speech.sri.com>
> > Sent: Friday, November 21, 2003 7:15 AM
> > Subject: Question about hidden-ngram
> >
> >
> > > Hi,
> > >  I've just tried the hidden-ngram tool to punctuate automatically an
> > > unpunctuated text. But I got some unexpected results - every word was
> > tagged
> > > with the *noevent*.
> > >
> > > I've used a training text in a following form:
> > >
> > > ...
> > > for more than a century <COM> the fingerprint has been the quintessential
> > piece
> > > of crime scene evidence <PER>
> > > but now the palm is getting its due <PER>
> > > ...
> > >
> > > Then I trained a 3-gram model with:
> > >
> > > ngram-count -write-vocab vocabulary -tolower -text trainingtext -write
> > output
> > > -lm lmfile
> > >
> > > ... and then I used hidden-ngram tool with following option:
> > >
> > > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab
> > tags -continuous
> > > -posteriors
> > >
> > > ... and received something like that:
> > >
> > > 6        *noevent* 0.998811 <com> 0.00117427 <per> 1.46659e-05 <qm>
> > 7.92597e-10
> > > mìsícù   *noevent* 0.999898 <com> 9.326e-05 <per> 9.07804e-06 <qm>
> > 4.61643e-10
> > > do       *noevent* 1 <com> 4.19776e-09 <per> 5.76912e-09 <qm> 6.25918e-12
> > > jednoho  *noevent* 0.999998 <com> 4.18691e-07 <per> 1.24419e-06 <qm>
> > 8.63805e-11
> > > roku     *noevent* 0.197671 <com> 0.801881 <per> 0.000340206 <qm>
> > 0.000107651
> > > jak      *noevent* 0.99997 <com> 2.44243e-05 <per> 1.32587e-06 <qm>
> > 4.09674e-06
> > > je       *noevent* 0.999857 <com> 0.000142836 <per> 2.47722e-07 <qm>
> > 2.47757e-07
> > > to       *noevent* 0.972235 <com> 0.0266202 <per> 0.000937748 <qm>
> > 0.000206936
> > > <unk>    *noevent* 0.979455 <com> 0.0205446 <per> 2.70218e-07 <qm>
> > 1.33261e-07
> > > uvedeno  *noevent* 0.933133 <com> 0.0538742 <per> 0.0129924 <qm>
> > 6.16205e-08
> > > na       *noevent* 0.999965 <com> 4.71218e-07 <per> 3.39777e-05 <qm>
> > 1.57228e-07
> > > výrobku  *noevent* 0.736376 <com> 0.168451 <per> 0.0947272 <qm> 0.00044499
> > >
> > > Please, can somebody tell me what I did wrong? And is there in SRILM a
> > tool to
> > > obtain a text-map from the training text?
> > >
> > > Thanks Jachym
> > >
> > >
> > >
> >
> >
>
>