Problem with language-specific characters in segment

Jáchym Kolář jachym at kky.zcu.cz
Fri Oct 11 04:47:17 PDT 2002


Hi to all!
I have a following problem with segment tool. In the output of segment appears <unk> token instead of words including language-specific characters - although in language model file they are saved correctly and input text file has the same coding (ISO-Latin 2) as the training text. 
 Does anybody know what's the problem?

Language model was buil using:
ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm lmfile2

Segment tool was used with option:
segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous

Disabling -unk option  I got right words in the output but posteriors are probably not correct.

Jachym Kolar
Department of Cybernetics
University of West-Bohemia
Pilsen, Czech Republic

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20021011/1182e4ae/attachment.html>


More information about the SRILM-User mailing list