Problem with language-specific characters in segment
Jáchym Kolář
jachym at kky.zcu.cz
Fri Oct 11 04:47:17 PDT 2002
Hi to all!
I have a following problem with segment tool. In the output of segment appears <unk> token instead of words including language-specific characters - although in language model file they are saved correctly and input text file has the same coding (ISO-Latin 2) as the training text.
Does anybody know what's the problem?
Language model was buil using:
ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm lmfile2
Segment tool was used with option:
segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous
Disabling -unk option I got right words in the output but posteriors are probably not correct.
Jachym Kolar
Department of Cybernetics
University of West-Bohemia
Pilsen, Czech Republic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20021011/1182e4ae/attachment.html>
More information about the SRILM-User
mailing list