[SRILM User List] Odd behavior in disambig and OOV words

Tue Jan 3 15:12:11 PST 2012

Hello:

For some time now I've been using *disambig* to perform diacritic
disambiguation of Arabic.  I create a open-vocabulary LM of diacritized
forms from a training corpus, and for the input I use a morphological
analysis tool to create, for each input word, a list of possible
diacritized forms to use as the V2 mapping for the input form (V1).  *
Disambig* is then used to select one of the diacritized forms using the LM.

This works well, but recently I noticed a strange behavior.  I have a small
input file (A) of about 200 lines of text.  I run it through the above
process, and I get a mapped output file as expected. Then I take the input
file A and replace two words in the last line with different words
(creating input file B).  I run B through the same process as A (this
results in a very slightly different map file -- but only for the two words
that were replaced).

The odd behavior is that, when I compare the output mapping of A and B, not
only is the last line different, but over 70 other words in the file (in
different sentences) also have different V2 mappings. Doing some checking,
I discover (not too surprisingly) that all the affected words are ones that
were not present in the LM, so the effect is related to how *disambig* is
handling OOV words.  Similar differences occur if I compare the mapped
output of two files concatenated together to the concatenation of two
file's mapped output (that is, [A+B].out  =/=  [A.out] + [B.out] ).

I need to find a way to make sure *disambig* handles these words
consistently, so that changes in one part of a file do not affect the
results in a different part.  I'm hoping that there is some option setting
in *disambig* or *ngram*-*count* that I've overlooked that will correct the
problem, but I currently don't see one.

For reference, I create my LM using the options:

   *ngram*-*count* -*text* training-input-file -*lm* model-name.lm -*order*5 -
*unk*

and I run disambig using the options:

   *disambig* -*keep*-*unk* -*text* test-file.in -*map* test-file.map -*
order* 5 -*lm* model-name.lm >  test-file.out

My test-file.map is created without conditional probabilities, and the list
of V2 forms is always alphabetized to ensure a consistent ordering. The
morphological analyzer which generates the V2 forms is always consistent,
and its output does not depend on word context.

Any advice or direction would be appreciated.

Thanks,

Ryan Roth
CCLS
Columbia University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120103/b5034a8d/attachment.html>