[SRILM User List] Odd behavior in disambig and OOV words
rmr4848 at gmail.com
Tue Jan 3 15:12:11 PST 2012
For some time now I've been using *disambig* to perform diacritic
disambiguation of Arabic. I create a open-vocabulary LM of diacritized
forms from a training corpus, and for the input I use a morphological
analysis tool to create, for each input word, a list of possible
diacritized forms to use as the V2 mapping for the input form (V1). *
Disambig* is then used to select one of the diacritized forms using the LM.
This works well, but recently I noticed a strange behavior. I have a small
input file (A) of about 200 lines of text. I run it through the above
process, and I get a mapped output file as expected. Then I take the input
file A and replace two words in the last line with different words
(creating input file B). I run B through the same process as A (this
results in a very slightly different map file -- but only for the two words
that were replaced).
The odd behavior is that, when I compare the output mapping of A and B, not
only is the last line different, but over 70 other words in the file (in
different sentences) also have different V2 mappings. Doing some checking,
I discover (not too surprisingly) that all the affected words are ones that
were not present in the LM, so the effect is related to how *disambig* is
handling OOV words. Similar differences occur if I compare the mapped
output of two files concatenated together to the concatenation of two
file's mapped output (that is, [A+B].out =/= [A.out] + [B.out] ).
I need to find a way to make sure *disambig* handles these words
consistently, so that changes in one part of a file do not affect the
results in a different part. I'm hoping that there is some option setting
in *disambig* or *ngram*-*count* that I've overlooked that will correct the
problem, but I currently don't see one.
For reference, I create my LM using the options:
*ngram*-*count* -*text* training-input-file -*lm* model-name.lm -*order*5 -
and I run disambig using the options:
*disambig* -*keep*-*unk* -*text* test-file.in -*map* test-file.map -*
order* 5 -*lm* model-name.lm > test-file.out
My test-file.map is created without conditional probabilities, and the list
of V2 forms is always alphabetized to ensure a consistent ordering. The
morphological analyzer which generates the V2 forms is always consistent,
and its output does not depend on word context.
Any advice or direction would be appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SRILM-User