Question about hidden-ngram
Jachym Kolar
jachym at kky.zcu.cz
Fri Nov 21 04:15:19 PST 2003
Hi,
I've just tried the hidden-ngram tool to punctuate automatically an
unpunctuated text. But I got some unexpected results - every word was tagged
with the *noevent*.
I've used a training text in a following form:
...
for more than a century <COM> the fingerprint has been the quintessential piece
of crime scene evidence <PER>
but now the palm is getting its due <PER>
...
Then I trained a 3-gram model with:
ngram-count -write-vocab vocabulary -tolower -text trainingtext -write output
-lm lmfile
... and then I used hidden-ngram tool with following option:
hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab tags -continuous
-posteriors
... and received something like that:
6 *noevent* 0.998811 <com> 0.00117427 <per> 1.46659e-05 <qm> 7.92597e-10
měsíců *noevent* 0.999898 <com> 9.326e-05 <per> 9.07804e-06 <qm> 4.61643e-10
do *noevent* 1 <com> 4.19776e-09 <per> 5.76912e-09 <qm> 6.25918e-12
jednoho *noevent* 0.999998 <com> 4.18691e-07 <per> 1.24419e-06 <qm> 8.63805e-11
roku *noevent* 0.197671 <com> 0.801881 <per> 0.000340206 <qm> 0.000107651
jak *noevent* 0.99997 <com> 2.44243e-05 <per> 1.32587e-06 <qm> 4.09674e-06
je *noevent* 0.999857 <com> 0.000142836 <per> 2.47722e-07 <qm> 2.47757e-07
to *noevent* 0.972235 <com> 0.0266202 <per> 0.000937748 <qm> 0.000206936
<unk> *noevent* 0.979455 <com> 0.0205446 <per> 2.70218e-07 <qm> 1.33261e-07
uvedeno *noevent* 0.933133 <com> 0.0538742 <per> 0.0129924 <qm> 6.16205e-08
na *noevent* 0.999965 <com> 4.71218e-07 <per> 3.39777e-05 <qm> 1.57228e-07
výrobku *noevent* 0.736376 <com> 0.168451 <per> 0.0947272 <qm> 0.00044499
Please, can somebody tell me what I did wrong? And is there in SRILM a tool to
obtain a text-map from the training text?
Thanks Jachym
More information about the SRILM-User
mailing list