Problem with language-specific characters in segment
stolcke at speech.sri.com
Sun Oct 13 08:20:53 PDT 2002
sorry to hear about the problems. I think it has to do with the fact
that the locale is
never set in segment.cc. try putting
right at the beginning of main() in segment.cc. (This applies to
several other programs as
well, and will be fixed in the next release.)
BTW, the -unk option only makes sense if your LM was trained with
instances of <unk>
(or the ngram-count -unk option). Otherwise unknown words will get zero
Jáchym Kolář wrote:
> Hi to all!
> I have a following problem with segment tool. In the output of segment
> appears <unk> token instead of words including
> language-specific characters - although in language model file they
> are saved correctly and input text file has the same coding (ISO-Latin
> 2) as the training text.
> Does anybody know what's the problem?
> Language model was buil using:
> ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm
> Segment tool was used with option:
> segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous
> Disabling -unk option I got right words in the output but posteriors
> are probably not correct.
> Jachym Kolar
> Department of Cybernetics
> University of West-Bohemia
> Pilsen, Czech Republic
More information about the SRILM-User