Problem with language-specific characters in segment
Andreas Stolcke
stolcke at speech.sri.com
Sun Oct 13 08:20:53 PDT 2002
Hi,
sorry to hear about the problems. I think it has to do with the fact
that the locale is
never set in segment.cc. try putting
setlocale(LC_CTYPE, "");
setlocale(LC_COLLATE, "");
right at the beginning of main() in segment.cc. (This applies to
several other programs as
well, and will be fixed in the next release.)
BTW, the -unk option only makes sense if your LM was trained with
instances of <unk>
(or the ngram-count -unk option). Otherwise unknown words will get zero
probability either
way.
--Andreas
Jáchym Kolář wrote:
> Hi to all!
> I have a following problem with segment tool. In the output of segment
> appears <unk> token instead of words including
> language-specific characters - although in language model file they
> are saved correctly and input text file has the same coding (ISO-Latin
> 2) as the training text.
> Does anybody know what's the problem?
>
> Language model was buil using:
> ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm
> lmfile2
>
> Segment tool was used with option:
> segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous
>
> Disabling -unk option I got right words in the output but posteriors
> are probably not correct.
>
> Jachym Kolar
> Department of Cybernetics
> University of West-Bohemia
> Pilsen, Czech Republic
>
More information about the SRILM-User
mailing list