Problem with language-specific characters in segment

Andreas Stolcke stolcke at speech.sri.com
Sun Oct 13 08:20:53 PDT 2002


Hi,

sorry to hear about the problems.  I think it has to do with the fact 
that the locale is
never set in segment.cc.   try putting

    setlocale(LC_CTYPE, "");
    setlocale(LC_COLLATE, "");

right at the beginning of main() in segment.cc.  (This applies to 
several other programs as
well, and will be fixed in the next release.)

BTW, the -unk option only makes sense if your LM was trained with 
instances of <unk>
(or the ngram-count -unk option).  Otherwise unknown words will get zero 
probability either
way.

--Andreas

Jáchym Kolář wrote:

> Hi to all!
> I have a following problem with segment tool. In the output of segment 
> appears <unk> token instead of words including 
> language-specific characters - although in language model file they 
> are saved correctly and input text file has the same coding (ISO-Latin 
> 2) as the training text. 
>  Does anybody know what's the problem?
>  
> Language model was buil using:
> ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm 
> lmfile2
>  
> Segment tool was used with option:
> segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous
>  
> Disabling -unk option  I got right words in the output but posteriors 
> are probably not correct.
>  
> Jachym Kolar
> Department of Cybernetics
> University of West-Bohemia
> Pilsen, Czech Republic
>  






More information about the SRILM-User mailing list