[SRILM User List] lattice-tool

Andreas Stolcke stolcke at speech.sri.com
Sun Jul 11 11:35:33 PDT 2010


ali sadiqui wrote:
> thank you for your answer,
> indeed, I knew that ngram-count was the good order to create a model of language but my ambiguity comes from that:
>
> During the segmentation of an Arab word to follow the model Prèfix-Stem-suffix
>
> A word “B” can give several results.
>
> Supposing that the word B gives place to 3 results of segmentation.
>
> b1 = mot1 + sufi1 (mot1 can be noted stem1)
>
> b2 = pref1 + mot2
>
> b3 = mot3
>
> Starting from corpus “A B C D E” I create a file (by programming):
>
> A mot1 suf1 C D E
>
> A pref1 mot2 C D E
>
> A mot3 C D E
>
> (to create all the possible ways)
>
> Then using SRLIM I will create a model of language of order 3 (for example) to use it to afterwards support a decomposition on other.
>
> My question is:
>
> - I supposed that I would need to create lattices, is what that is true or false?
>
> - If they is true how to proceed to use lattice-tool
>
> I am very grateful for your help.
>
>
> Ali Sadiqui
> --- En date de : Jeu 22.4.10, Andreas Stolcke <stolcke at speech.sri.com> a écrit 
>   
Ali,

sorry for not responding earlier.   Your desire to use lattices now 
makes sense.
You need to encode your morphologically analyzed training data as 
lattices in either the HTK or the PFSG format.
PFSG is more limited but should be enough in your case.  See the 
pfsg-format(5) man page for a description There are also some examples in
$SRILM/lattice/test/tests/lattice-expansion/ .

After each sentence is encoded as a lattice, you would use
    lattice-tool -in-lattice-list ... -write-ngrams NGRAMS
to generate ngram counts from the corpus.  Then you can train the LM using
    ngram-count -float-counts -read NGRAMS -lm ...
Note that the counts will be fractional, so you can only use certain 
smoothing methods, like --wbdiscount.

If you have trouble with the lattice generation you can also generate 
the ngram counts yourself.

Note there are more sophisticated ways to model Arabic morphology, using 
factored LMs (FLMs).  Google the work of Katrin Kirchhoff, she developed 
FLMs partly for this purpose, and this is now incorporated in SRILM (if 
you have question about this approach contact her directly).

Andreas


>> De: Andreas Stolcke <stolcke at speech.sri.com>
>> Objet: Re: [SRILM User List] lattice-tool
>> À: "ali sadiqui" <sadiqui2000 at yahoo.fr>
>> Cc: srilm-user at speech.sri.com
>> Date: Jeudi 22 avril 2010, 6h42
>> ali sadiqui wrote:
>>     
>>> hi,
>>> I am a beginner SRILM,
>>> I would like to create a lattice from corpora
>>> "A B{b1, b2, b3) C" and then create a language model
>>> I know I have to use the tool lattice-tool, but how do
>>>       
>> I proceed, I was stuck there.  I guess I should create
>> a file-format pfsg but.
>>     
>>> If so:
>>>          
>>>       
>>    How to define the nodes?
>>     
>>>          
>>>       
>>    Calculating the cost?
>>     
>>> Is this is a manually or using a command?
>>> In short, how to fill it?
>>>
>>> I am very grateful for your help.
>>> thank you for your help
>>>    
>>>       
>> I think you are confused about how to build language
>> models.  You typically create LMs directly from ngram
>> counts extracted from a corpus, with no need to build
>> lattices.
>> Please consult the file $SRILM/doc/lm-intro for the most
>> basic procedures, and the FAQ file and recommended text
>> books for more details.
>>
>> Andreas
>>
>>     
>>>        
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>>>    
>>>       
>>     
>
>
>       
>
> ------------------------------------------------------------------------
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100711/35eaaa55/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/bmp
Size: 18194 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100711/35eaaa55/attachment.bmp>


More information about the SRILM-User mailing list