From asr.naxingyu at gmail.com  Sun Jan  8 21:02:48 2017
From: asr.naxingyu at gmail.com (Xingyu Na)
Date: Mon, 9 Jan 2017 13:02:48 +0800
Subject: [SRILM User List] srilm download fail
Message-ID: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com>

Hi,

I filled in the download form and clicked the accept button. Then I was redirected to a php page with only some php code like this:
===================
$value) { $new_array["$key"] = strip_tags($value); } return $new_array; } function reclog($logfile) { global $newpost; date_default_timezone_set('America/Los_Angeles'); $fecha = date(DATE_RFC2822); $remote_addr = $_SERVER['REMOTE_ADDR']; // REMOTE_HOST is undefined index in current Apache2 server // $remote_host = $_SERVER['REMOTE_HOST'] ?: gethostbyaddr($remote_addr); $remote_host = gethostbyaddr($remote_addr); $fh = fopen("$logfile", 'a+'); fwrite($fh, "$fecha\n"); fwrite($fh, "From_Addr=$remote_addr\n"); fwrite($fh, "From_Host=$remote_host\n"); fwrite($fh, "Name=" . $newpost['WWW_name'] . "\n"); fwrite($fh, "Org=" . $newpost['WWW_org'] . "\n"); fwrite($fh, "Address=" . $newpost['WWW_address'] . "\n"); fwrite($fh, "Email=" . $newpost['WWW_email'] . "\n"); fwrite($fh, "URL=" . $newpost['WWW_url'] . "\n"); fwrite($fh, "File=" . $newpost['WWW_file'] . "\n"); if (!isset($newpost['WWW_list'])) $newpost['WWW_list'] = ""; fwrite($fh, "List=" . $newpost['WWW_list'] . "\n\n"); fclose($fh); } function recemail($maillist) { global $newpost; $email = preg_replace('/\s+/', ' ', $newpost['WWW_email']); $fh = fopen("$maillist", 'a+'); fwrite($fh, $newpost['WWW_name'] . " <$email>\n"); fclose($fh); } function download($file) { if (file_exists($file)) { header('Content-Description: File Transfer'); header('Content-Type: application/gzip'); header('Content-Disposition: attachment; filename='.basename($file)); header('Expires: 0'); header('Cache-Control: must-revalidate'); header('Pragma: public'); header('Content-Length: ' . filesize($file)); readfile($file); exit; } else { header("Content-type: text/plain\n"); header("Status: 404 Not Found\n"); print "$file not found!\n"; } } /**** MAIN ****/ // clean input values $newpost = strip_html_in_array($_POST); // check for proper form entry if (empty($newpost['WWW_name']) || empty($newpost['WWW_email'])) { if (!empty($newpost['WWW_signup'])) { // for sign-up print "Your Name or Email are missing.
"; print "Please go back and complete the form.

"; exit(0); } else if (empty($newpost['WWW_address'])) { // for download print "Your Name, Address or Email are missing.

"; print "Please go back and complete the form.

"; exit(0); } } /* DEBUGGING print "Send result:

"; print "

";
print_r($_POST);
print_r($newpost);
print "
"; exit (0); */ if (!isset($newpost['WWW_list'])) { recemail($maillist_announce); } else if (isset($newpost['WWW_signup'])) { recemail($maillist_users); } if (isset($newpost['WWW_signup'])) { header('Content-Description: Display signup successfully done'); header('Content-Type: text/html'); header('Expires: 0'); header('Cache-Control: must-revalidate'); header('Pragma: public'); print "
"; print ""; print "
"; print ""; print ""; exit(0); } else { // not signup so it's download reclog($logfile); download("$datadir/" . $newpost['WWW_file']); } ?>
===================

I tried safari and chrome. Anyone could help? Thanks!

Xingyu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170109/c66b03cd/attachment.html>

From chiachi at speech.sri.com  Sun Jan  8 23:14:19 2017
From: chiachi at speech.sri.com (Chiachi Hung)
Date: Sun, 8 Jan 2017 23:14:19 -0800
Subject: [SRILM User List] srilm download fail
In-Reply-To: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com>
References: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com>
Message-ID: <03f805f2-1ac7-bd29-239e-6b1869ea31b3@speech.sri.com>

Hi Xingyu,

Sorry for the inconvenience it might cause you.  We have restored the 
service.  Please give it a try.

Chiachi


On 01/08/2017 09:02 PM, Xingyu Na wrote:
> Hi,
>
> I filled in the download form and clicked the accept button. Then I 
> was redirected to a php page with only some php code like this:
> ===================
>
>
> I tried safari and chrome. Anyone could help? Thanks!
>
> Xingyu
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170108/90e3e7bc/attachment.html>

From asr.naxingyu at gmail.com  Mon Jan  9 00:10:52 2017
From: asr.naxingyu at gmail.com (Xingyu Na)
Date: Mon, 9 Jan 2017 16:10:52 +0800
Subject: [SRILM User List] srilm download fail
In-Reply-To: <03f805f2-1ac7-bd29-239e-6b1869ea31b3@speech.sri.com>
References: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com>
 <03f805f2-1ac7-bd29-239e-6b1869ea31b3@speech.sri.com>
Message-ID: <D328015D-7812-415C-988D-411D7A0E8670@gmail.com>

It works. Thank you!

X.

> 在 2017年1月9日，15:14，Chiachi Hung <chiachi at speech.sri.com> 写道：
> 
> Hi Xingyu,
> 
> Sorry for the inconvenience it might cause you.  We have restored the service.  Please give it a try.
> 
> Chiachi
> 
> 
> On 01/08/2017 09:02 PM, Xingyu Na wrote:
>> Hi,
>> 
>> I filled in the download form and clicked the accept button. Then I was redirected to a php page with only some php code like this:
>> ===================
>> 
>> 
>> I tried safari and chrome. Anyone could help? Thanks!
>> 
>> Xingyu
>> 
>> 
>> 
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user <http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170109/badc2e49/attachment.html>

From tsuki_stefy at yahoo.com  Tue Jan 24 04:14:56 2017
From: tsuki_stefy at yahoo.com (Stefy D.)
Date: Tue, 24 Jan 2017 12:14:56 +0000 (UTC)
Subject: [SRILM User List] perplexity results
References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com>
Message-ID: <1323358276.4219167.1485260096850@mail.yahoo.com>

Hello. I have a question regarding perplexity. I am using srilm to compute the perplexity of some sentences using a LM trained on a big corpus. Given a sentence and a LM, the perplexity tells how well that sentence fits to the language (as far as i understood). And the lower the perplexity, the better the sentence fits.
$NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text Wikipedia.en-es.es -lm lm/lmodel_es.lm
$NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl testlabeled.en-es.es  > perplexity_es_testlabeled.ppl
I did the same on EN and on ES and here are some results I got:
Sixty-six parent coordinators were laid off," the draft complaint says, "and not merely excessed.1 sentences, 14 words, 0 OOVs0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
Mexico's Enrique Pena Nieto faces tough start1 sentences, 7 words, 0 OOVs0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
The NATO mission officially ended Oct. 31.1 sentences, 7 words, 0 OOVs0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
Sesenta y seis padres coordinadores fueron despedidos," el proyecto de denuncia, dice, "y no simplemente excessed.1 sentences, 16 words, 0 OOVs0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72
México Enrique Peña Nieto enfrenta duras comienzo1 sentences, 7 words, 0 OOVs0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7

Why are the perplexities for the EN sentences so big? The smallest ppl i get for an EN sentence is about 250. The spanish sentences have some errors, so i was expecting big ppl numbers. Should i change something in the way i compute the lms?
Thank you very much!!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170124/9faa4845/attachment.html>

From nemeskeyd at gmail.com  Tue Jan 24 04:57:58 2017
From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=)
Date: Tue, 24 Jan 2017 13:57:58 +0100
Subject: [SRILM User List] perplexity results
In-Reply-To: <1323358276.4219167.1485260096850@mail.yahoo.com>
References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com>
 <1323358276.4219167.1485260096850@mail.yahoo.com>
Message-ID: <CAHOrvWecGYaSb6Wmaohpoi0Zk7LRX9nXPY054j71E5wRkh6Lhg@mail.gmail.com>

Hi,

it is hard to tell without knowing e.g. the training set. But I would
try running ngram with higher values for -debug. I think even -debug 2
tells you the logprob of the individual words. That could be a start.
I actually added another debug level (100), where I print the 5 most
likely candidates (requires a "forward trie" in addition to the
default "backwards" one to be of usable speed) to get a sense of the
proportions and how the model and the text differs.

Also, just wondering. Is the training corpus bilingual (en-es)?

Best,
Dávid Nemeskey

On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. <tsuki_stefy at yahoo.com> wrote:
> Hello. I have a question regarding perplexity. I am using srilm to compute
> the perplexity of some sentences using a LM trained on a big corpus. Given a
> sentence and a LM, the perplexity tells how well that sentence fits to the
> language (as far as i understood). And the lower the perplexity, the better
> the sentence fits.
>
> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text
> Wikipedia.en-es.es -lm lm/lmodel_es.lm
>
> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl
> testlabeled.en-es.es  > perplexity_es_testlabeled.ppl
>
> I did the same on EN and on ES and here are some results I got:
>
> Sixty-six parent coordinators were laid off," the draft complaint says, "and
> not merely excessed.
> 1 sentences, 14 words, 0 OOVs
> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>
> Mexico's Enrique Pena Nieto faces tough start
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>
> The NATO mission officially ended Oct. 31.
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
>
> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de
> denuncia, dice, "y no simplemente excessed.
> 1 sentences, 16 words, 0 OOVs
> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72
>
> México Enrique Peña Nieto enfrenta duras comienzo
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7
>
>
> Why are the perplexities for the EN sentences so big? The smallest ppl i get
> for an EN sentence is about 250. The spanish sentences have some errors, so
> i was expecting big ppl numbers. Should i change something in the way i
> compute the lms?
>
> Thank you very much!!
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user


From mstefd22 at gmail.com  Tue Jan 24 05:46:12 2017
From: mstefd22 at gmail.com (Stef M)
Date: Tue, 24 Jan 2017 14:46:12 +0100
Subject: [SRILM User List] perplexity results
Message-ID: <CA+hoJ380-aSMOefjzxKjW+TGuUzbHoHHWFo7_MXKBuk9_FP_7w@mail.gmail.com>

Hello David.

Thank you very much for answering. I am not sure if you received my reply
as the yahoo servers have problems right now so i switched to gmail (sorry
if you received already the email).


I used Wikipedia parallel corpus en-es for training the two lms (
http://opus.lingfil.uu.se/Wikipedia.php, 1.8M sentence pairs). I used the
-debug 2 as you said and below are the results. Could you please help me
understand why the perplexity numbers are so high for the EN sentences
since they are well formed? For testing spanish i used machine translated
output so i was expecting big numbers for ppl. Thank you!


Sixty-six parent coordinators were laid off," the draft complaint says,
"and not merely excessed.
p( Sixty-six | <s> )  = [1gram] 2.16995e-09 [ -8.66355 ]
p( parent | Sixty-six ...)  = [1gram] 1.0949e-05 [ -4.96063 ]
p( coordinators | parent ...)  = [1gram] 3.37871e-07 [ -6.47125 ]
p( were | coordinators ...)  = [1gram] 0.00120231 [ -2.91998 ]
p( laid | were ...)  = [2gram] 0.000696035 [ -3.15737 ]
p( off," | laid ...)  = [1gram] 2.33407e-08 [ -7.63189 ]
p( the | off," ...)  = [2gram] 0.0469306 [ -1.32854 ]
p( draft | the ...)  = [2gram] 7.67904e-05 [ -4.11469 ]
p( complaint | draft ...)  = [1gram] 8.13141e-07 [ -6.08983 ]
p( says, | complaint ...)  = [1gram] 1.17395e-05 [ -4.93035 ]
p( "and | says, ...)  = [2gram] 0.00147669 [ -2.83071 ]
p( not | "and ...)  = [1gram] 0.000275198 [ -3.56035 ]
p( merely | not ...)  = [2gram] 0.00173666 [ -2.76029 ]
p( <unk> | merely ...)  = [1gram] 0.0796503 [ -1.09881 ]
p( </s> | <unk> ...)  = [1gram] 0.0258359 [ -1.58778 ]
1 sentences, 14 words, 0 OOVs
0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9


Mexico's Enrique Pena Nieto faces tough start
p( Mexico's | <s> )  = [2gram] 1.31547e-06 [ -5.88092 ]
p( Enrique | Mexico's ...)  = [1gram] 1.34348e-05 [ -4.87177 ]
p( Pena | Enrique ...)  = [1gram] 1.83116e-06 [ -5.73727 ]
p( Nieto | Pena ...)  = [1gram] 1.6622e-06 [ -5.77932 ]
p( faces | Nieto ...)  = [1gram] 1.61354e-05 [ -4.79222 ]
p( tough | faces ...)  = [1gram] 2.80928e-06 [ -5.5514 ]
p( start | tough ...)  = [1gram] 2.90611e-05 [ -4.53669 ]
p( </s> | start ...)  = [1gram] 0.00941231 [ -2.0263 ]
1 sentences, 7 words, 0 OOVs
0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964


The NATO mission officially ended Oct. 31.
p( The | <s> )  = [2gram] 0.143584 [ -0.842893 ]
p( NATO | The ...)  = [3gram] 5.55208e-06 [ -5.25554 ]
p( mission | NATO ...)  = [1gram] 3.10877e-05 [ -4.50741 ]
p( officially | mission ...)  = [1gram] 2.81221e-05 [ -4.55095 ]
p( ended | officially ...)  = [2gram] 0.00976927 [ -2.01014 ]
p( Oct. | ended ...)  = [1gram] 2.4073e-07 [ -6.61847 ]
p( 31. | Oct. ...)  = [1gram] 3.60453e-06 [ -5.44315 ]
p( </s> | 31. ...)  = [2gram] 0.907671 [ -0.0420717 ]
1 sentences, 7 words, 0 OOVs
0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170124/6783cc2b/attachment.html>

From nemeskeyd at gmail.com  Tue Jan 24 07:27:56 2017
From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=)
Date: Tue, 24 Jan 2017 16:27:56 +0100
Subject: [SRILM User List] perplexity results
In-Reply-To: <CA+hoJ380-aSMOefjzxKjW+TGuUzbHoHHWFo7_MXKBuk9_FP_7w@mail.gmail.com>
References: <CA+hoJ380-aSMOefjzxKjW+TGuUzbHoHHWFo7_MXKBuk9_FP_7w@mail.gmail.com>
Message-ID: <CAHOrvWfY1tO_+BnCz5YG3CmQwPMrkfs-x8P6JaELMhrQYeEusw@mail.gmail.com>

If you have a look at the content of the first square brackets, you
can see that very few words come from 2-grams or higher. What this
means is the model could almost never find the context in the training
data and had to fall back on the unigram model quite a lot, so what
you see here is basically the performance of a -order 1 model -- but
the numbers seem quite high even for that... Are you sure the commands
you issued were the ones in your mail?

If yes, it would be interesting to see statistics of the corpus you
used. How big is the vocabulary? How big are the unigram frequencies?
Is it possible that the distribution has a very long tail, and almost
all words occur only 1-2 times?

I would also do some preprocessing on the data, like lowercasing
everything and running a tokenizer on it to split e.g. '"and' to the
two tokens '"' and 'and'.

On Tue, Jan 24, 2017 at 2:46 PM, Stef M <mstefd22 at gmail.com> wrote:
> Hello David.
>
> Thank you very much for answering. I am not sure if you received my reply as
> the yahoo servers have problems right now so i switched to gmail (sorry if
> you received already the email).
>
>
> I used Wikipedia parallel corpus en-es for training the two lms
> (http://opus.lingfil.uu.se/Wikipedia.php, 1.8M sentence pairs). I used the
> -debug 2 as you said and below are the results. Could you please help me
> understand why the perplexity numbers are so high for the EN sentences since
> they are well formed? For testing spanish i used machine translated output
> so i was expecting big numbers for ppl. Thank you!
>
>
> Sixty-six parent coordinators were laid off," the draft complaint says, "and
> not merely excessed.
> p( Sixty-six | <s> )  = [1gram] 2.16995e-09 [ -8.66355 ]
> p( parent | Sixty-six ...)  = [1gram] 1.0949e-05 [ -4.96063 ]
> p( coordinators | parent ...)  = [1gram] 3.37871e-07 [ -6.47125 ]
> p( were | coordinators ...)  = [1gram] 0.00120231 [ -2.91998 ]
> p( laid | were ...)  = [2gram] 0.000696035 [ -3.15737 ]
> p( off," | laid ...)  = [1gram] 2.33407e-08 [ -7.63189 ]
> p( the | off," ...)  = [2gram] 0.0469306 [ -1.32854 ]
> p( draft | the ...)  = [2gram] 7.67904e-05 [ -4.11469 ]
> p( complaint | draft ...)  = [1gram] 8.13141e-07 [ -6.08983 ]
> p( says, | complaint ...)  = [1gram] 1.17395e-05 [ -4.93035 ]
> p( "and | says, ...)  = [2gram] 0.00147669 [ -2.83071 ]
> p( not | "and ...)  = [1gram] 0.000275198 [ -3.56035 ]
> p( merely | not ...)  = [2gram] 0.00173666 [ -2.76029 ]
> p( <unk> | merely ...)  = [1gram] 0.0796503 [ -1.09881 ]
> p( </s> | <unk> ...)  = [1gram] 0.0258359 [ -1.58778 ]
> 1 sentences, 14 words, 0 OOVs
> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>
>
> Mexico's Enrique Pena Nieto faces tough start
> p( Mexico's | <s> )  = [2gram] 1.31547e-06 [ -5.88092 ]
> p( Enrique | Mexico's ...)  = [1gram] 1.34348e-05 [ -4.87177 ]
> p( Pena | Enrique ...)  = [1gram] 1.83116e-06 [ -5.73727 ]
> p( Nieto | Pena ...)  = [1gram] 1.6622e-06 [ -5.77932 ]
> p( faces | Nieto ...)  = [1gram] 1.61354e-05 [ -4.79222 ]
> p( tough | faces ...)  = [1gram] 2.80928e-06 [ -5.5514 ]
> p( start | tough ...)  = [1gram] 2.90611e-05 [ -4.53669 ]
> p( </s> | start ...)  = [1gram] 0.00941231 [ -2.0263 ]
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>
>
>
> The NATO mission officially ended Oct. 31.
> p( The | <s> )  = [2gram] 0.143584 [ -0.842893 ]
> p( NATO | The ...)  = [3gram] 5.55208e-06 [ -5.25554 ]
> p( mission | NATO ...)  = [1gram] 3.10877e-05 [ -4.50741 ]
> p( officially | mission ...)  = [1gram] 2.81221e-05 [ -4.55095 ]
> p( ended | officially ...)  = [2gram] 0.00976927 [ -2.01014 ]
> p( Oct. | ended ...)  = [1gram] 2.4073e-07 [ -6.61847 ]
> p( 31. | Oct. ...)  = [1gram] 3.60453e-06 [ -5.44315 ]
> p( </s> | 31. ...)  = [2gram] 0.907671 [ -0.0420717 ]
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6


From stolcke at icsi.berkeley.edu  Tue Jan 24 10:06:00 2017
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 24 Jan 2017 10:06:00 -0800
Subject: [SRILM User List] perplexity results
In-Reply-To: <CAHOrvWecGYaSb6Wmaohpoi0Zk7LRX9nXPY054j71E5wRkh6Lhg@mail.gmail.com>
References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com>
 <1323358276.4219167.1485260096850@mail.yahoo.com>
 <CAHOrvWecGYaSb6Wmaohpoi0Zk7LRX9nXPY054j71E5wRkh6Lhg@mail.gmail.com>
Message-ID: <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu>

Make sure text normalization is consistent between training and test 
data (e.g, capitalization - consider mapping to lower-case, and encoding 
of diacritics).

Also, you're using -unk, i.e., your model contains an unknown-word 
token, which means OOVs get assigned a non-zero, but possibly very low 
probability.     This could mask big divergence in the vocabulary, and 
the high perplexity could be the result of lots of OOV words that all 
get a low probability via <unk>.  Try training without -unk and observe 
the tally of OOVs in the ppl output.

Andreas

On 1/24/2017 4:57 AM, Dávid Nemeskey wrote:
> Hi,
>
> it is hard to tell without knowing e.g. the training set. But I would
> try running ngram with higher values for -debug. I think even -debug 2
> tells you the logprob of the individual words. That could be a start.
> I actually added another debug level (100), where I print the 5 most
> likely candidates (requires a "forward trie" in addition to the
> default "backwards" one to be of usable speed) to get a sense of the
> proportions and how the model and the text differs.
>
> Also, just wondering. Is the training corpus bilingual (en-es)?
>
> Best,
> Dávid Nemeskey
>
> On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. <tsuki_stefy at yahoo.com> wrote:
>> Hello. I have a question regarding perplexity. I am using srilm to compute
>> the perplexity of some sentences using a LM trained on a big corpus. Given a
>> sentence and a LM, the perplexity tells how well that sentence fits to the
>> language (as far as i understood). And the lower the perplexity, the better
>> the sentence fits.
>>
>> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text
>> Wikipedia.en-es.es -lm lm/lmodel_es.lm
>>
>> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl
>> testlabeled.en-es.es  > perplexity_es_testlabeled.ppl
>>
>> I did the same on EN and on ES and here are some results I got:
>>
>> Sixty-six parent coordinators were laid off," the draft complaint says, "and
>> not merely excessed.
>> 1 sentences, 14 words, 0 OOVs
>> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>>
>> Mexico's Enrique Pena Nieto faces tough start
>> 1 sentences, 7 words, 0 OOVs
>> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>>
>> The NATO mission officially ended Oct. 31.
>> 1 sentences, 7 words, 0 OOVs
>> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
>>
>> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de
>> denuncia, dice, "y no simplemente excessed.
>> 1 sentences, 16 words, 0 OOVs
>> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72
>>
>> México Enrique Peña Nieto enfrenta duras comienzo
>> 1 sentences, 7 words, 0 OOVs
>> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7
>>
>>
>> Why are the perplexities for the EN sentences so big? The smallest ppl i get
>> for an EN sentence is about 250. The spanish sentences have some errors, so
>> i was expecting big ppl numbers. Should i change something in the way i
>> compute the lms?
>>
>> Thank you very much!!
>>
>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>


From xulikui123321 at 163.com  Wed Feb  8 23:20:37 2017
From: xulikui123321 at 163.com (=?GBK?B?0Ow=?=)
Date: Thu, 9 Feb 2017 15:20:37 +0800 (CST)
Subject: [SRILM User List] perplexity results
In-Reply-To: <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu>
References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com>
 <1323358276.4219167.1485260096850@mail.yahoo.com>
 <CAHOrvWecGYaSb6Wmaohpoi0Zk7LRX9nXPY054j71E5wRkh6Lhg@mail.gmail.com>
 <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu>
Message-ID: <3ced5bc5.74d0.15a21bed04e.Coremail.xulikui123321@163.com>

Hi, Andreas:


    I want to rewrite the program ngram to web service, so i can query perplexity on Browser page.
    in ~/srilm/lm/src directory, i rewrite ngram.cc and named ngramService.cc
     then i compile it with follow command:
    g++ -m64 -I ~/srilm/include/ -c ngramService.cc -o ngramService
    it compile succecced.
      But when i execute it,
System prompt：./ngramService: cannot execute binary file
even  after chmod +x ngramService command.


am i miss something in compile command ?


my machine is 64, when i type unama -a:
Linux bjzw_48_43 2.6.32-504.23.4.el6.x86_64 #1 SMP Fri May 29 10:16:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170209/9694bcc1/attachment.html>

From mstefd22 at gmail.com  Thu Feb  9 04:21:29 2017
From: mstefd22 at gmail.com (Stef M)
Date: Thu, 9 Feb 2017 13:21:29 +0100
Subject: [SRILM User List] perplexity results
In-Reply-To: <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu>
References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com>
 <1323358276.4219167.1485260096850@mail.yahoo.com>
 <CAHOrvWecGYaSb6Wmaohpoi0Zk7LRX9nXPY054j71E5wRkh6Lhg@mail.gmail.com>
 <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu>
Message-ID: <CA+hoJ3_RHfQzAHDFeKH9NdrE5i_7QEkCVm-B+2ebX31TqOHAGw@mail.gmail.com>

Hello David and Andreas,

sorry for replying so late. Thank you very much for your suggestions.
Indeed, I had forgotten to preprocess the test set. I got better results
after preprocessing, so thanks a lot for pointing it out!

2017-01-24 19:06 GMT+01:00 Andreas Stolcke <stolcke at icsi.berkeley.edu>:

> Make sure text normalization is consistent between training and test data
> (e.g, capitalization - consider mapping to lower-case, and encoding of
> diacritics).
>
> Also, you're using -unk, i.e., your model contains an unknown-word token,
> which means OOVs get assigned a non-zero, but possibly very low
> probability.     This could mask big divergence in the vocabulary, and the
> high perplexity could be the result of lots of OOV words that all get a low
> probability via <unk>.  Try training without -unk and observe the tally of
> OOVs in the ppl output.
>
> Andreas
>
> On 1/24/2017 4:57 AM, Dávid Nemeskey wrote:
>
>> Hi,
>>
>> it is hard to tell without knowing e.g. the training set. But I would
>> try running ngram with higher values for -debug. I think even -debug 2
>> tells you the logprob of the individual words. That could be a start.
>> I actually added another debug level (100), where I print the 5 most
>> likely candidates (requires a "forward trie" in addition to the
>> default "backwards" one to be of usable speed) to get a sense of the
>> proportions and how the model and the text differs.
>>
>> Also, just wondering. Is the training corpus bilingual (en-es)?
>>
>> Best,
>> Dávid Nemeskey
>>
>> On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. <tsuki_stefy at yahoo.com> wrote:
>>
>>> Hello. I have a question regarding perplexity. I am using srilm to
>>> compute
>>> the perplexity of some sentences using a LM trained on a big corpus.
>>> Given a
>>> sentence and a LM, the perplexity tells how well that sentence fits to
>>> the
>>> language (as far as i understood). And the lower the perplexity, the
>>> better
>>> the sentence fits.
>>>
>>> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text
>>> Wikipedia.en-es.es -lm lm/lmodel_es.lm
>>>
>>> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl
>>> testlabeled.en-es.es  > perplexity_es_testlabeled.ppl
>>>
>>> I did the same on EN and on ES and here are some results I got:
>>>
>>> Sixty-six parent coordinators were laid off," the draft complaint says,
>>> "and
>>> not merely excessed.
>>> 1 sentences, 14 words, 0 OOVs
>>> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>>>
>>> Mexico's Enrique Pena Nieto faces tough start
>>> 1 sentences, 7 words, 0 OOVs
>>> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>>>
>>> The NATO mission officially ended Oct. 31.
>>> 1 sentences, 7 words, 0 OOVs
>>> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
>>>
>>> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de
>>> denuncia, dice, "y no simplemente excessed.
>>> 1 sentences, 16 words, 0 OOVs
>>> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72
>>>
>>> México Enrique Peña Nieto enfrenta duras comienzo
>>> 1 sentences, 7 words, 0 OOVs
>>> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7
>>>
>>>
>>> Why are the perplexities for the EN sentences so big? The smallest ppl i
>>> get
>>> for an EN sentence is about 250. The spanish sentences have some errors,
>>> so
>>> i was expecting big ppl numbers. Should i change something in the way i
>>> compute the lms?
>>>
>>> Thank you very much!!
>>>
>>>
>>>
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170209/1daf6b38/attachment.html>

From nshmyrev at yandex.ru  Thu Feb  9 05:25:45 2017
From: nshmyrev at yandex.ru (Nickolay V. Shmyrev)
Date: Thu, 09 Feb 2017 16:25:45 +0300
Subject: [SRILM User List] perplexity results
In-Reply-To: <3ced5bc5.74d0.15a21bed04e.Coremail.xulikui123321@163.com>
References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com>
 <1323358276.4219167.1485260096850@mail.yahoo.com>
 <CAHOrvWecGYaSb6Wmaohpoi0Zk7LRX9nXPY054j71E5wRkh6Lhg@mail.gmail.com>
 <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu>
 <3ced5bc5.74d0.15a21bed04e.Coremail.xulikui123321@163.com>
Message-ID: <652881486646745@web7m.yandex.ru>

The option `-c` compiles an object file, a temporary file you can not execute.

To create executable you need to link object files and the libraries. 

You can learn more about basic GCC usage from documentation, for example this book is good:

http://www.network-theory.co.uk/docs/gccintro/index.html

You need Chapter 2 called "Compiling a C program".


09.02.2017, 10:59, "徐" <xulikui123321 at 163.com>:
> Hi, Andreas:
>
>     I want to rewrite the program ngram to web service, so i can query perplexity on Browser page.
>     in ~/srilm/lm/src directory, i rewrite ngram.cc and named ngramService.cc
>      then i compile it with follow command:
>     g++ -m64 -I ~/srilm/include/ -c ngramService.cc -o ngramService
>     it compile succecced.
>       But when i execute it,
> System prompt：./ngramService: cannot execute binary file
> even  after chmod +x ngramService command.
>
> am i miss something in compile command ?
>
> my machine is 64, when i type unama -a:
> Linux bjzw_48_43 2.6.32-504.23.4.el6.x86_64 #1 SMP Fri May 29 10:16:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> ,
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user


From kalpeshk2011 at gmail.com  Sat Mar 11 11:32:11 2017
From: kalpeshk2011 at gmail.com (Kalpesh Krishna)
Date: Sun, 12 Mar 2017 01:02:11 +0530
Subject: [SRILM User List] Generate Probability Distribution
Message-ID: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>

Hello,
I have a context of words and I've built an N-gram language model using
./ngram-count. I wish to generate a probability distribution (over the
entire vocabulary of words) of the next word. I can't seem to be able to
find a good way to do this with ./ngram.
What's the best way to do this?
For example, if my vocabulary has words "apple, banana, carrot", and my
context is "apple banana banana carrot", I want a distribution like -
{"apple": 0.25, "banana": 0.5, "carrot": 0.25}.

Thank you,
Kalpesh Krishna
http://martiansideofthemoon.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170312/135296ac/attachment.html>

From nemeskeyd at gmail.com  Sun Mar 12 00:12:07 2017
From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=)
Date: Sun, 12 Mar 2017 09:12:07 +0100
Subject: [SRILM User List] Generate Probability Distribution
In-Reply-To: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>
References: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>
Message-ID: <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>

Hi Kalpesh,

well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in
lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model).
You could simply call it on every word in the vocabulary. However, be
warned that this will be very slow for any reasonable vocabulary size (say
10k and up). This function is also what generateWord() calls, that is why
the latter is so slow.

If you just wanted the top n most probable words, the situation would be a
bit different. Then wordProb() wouldn't be the optimal solution because the
trie built by ngram is reversed (meaning you have to go back from the word
to the root, and not the other way around), and you had to query all words
to get the most probably one. So when I wanted to do this, I built another
trie (from the root up to the word), which made it much faster, though I am
not sure it was 100% correct in the face of negative backoff weights. But
it wouldn't help in your case, I guess.

Best,
Dávid

On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
wrote:

> Hello,
> I have a context of words and I've built an N-gram language model using
> ./ngram-count. I wish to generate a probability distribution (over the
> entire vocabulary of words) of the next word. I can't seem to be able to
> find a good way to do this with ./ngram.
> What's the best way to do this?
> For example, if my vocabulary has words "apple, banana, carrot", and my
> context is "apple banana banana carrot", I want a distribution like -
> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>
> Thank you,
> Kalpesh Krishna
> http://martiansideofthemoon.github.io/
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170312/c1efa231/attachment.html>

From kalpeshk2011 at gmail.com  Sun Mar 12 04:10:33 2017
From: kalpeshk2011 at gmail.com (Kalpesh Krishna)
Date: Sun, 12 Mar 2017 16:40:33 +0530
Subject: [SRILM User List] Generate Probability Distribution
In-Reply-To: <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>
References: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>
 <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>
Message-ID: <CAGo=34UVxeO41O6bFV+25NYdU+15nWeMKDT_8e_RUajojC39Ng@mail.gmail.com>

Hi Dávid,
Thank you for your response. Are there any existing binaries which will
help me do this quickly? I don't mind a non-SRILM ARPA file reader either.
Yes, top N words might be good enough in my use case, especially when they
cover more than 99% of the probability mass. I like the idea of building a
trie to do this.

Thank you,
Kalpesh

On 12 Mar 2017 1:42 p.m., "Dávid Nemeskey" <nemeskeyd at gmail.com> wrote:

Hi Kalpesh,

well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in
lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model).
You could simply call it on every word in the vocabulary. However, be
warned that this will be very slow for any reasonable vocabulary size (say
10k and up). This function is also what generateWord() calls, that is why
the latter is so slow.

If you just wanted the top n most probable words, the situation would be a
bit different. Then wordProb() wouldn't be the optimal solution because the
trie built by ngram is reversed (meaning you have to go back from the word
to the root, and not the other way around), and you had to query all words
to get the most probably one. So when I wanted to do this, I built another
trie (from the root up to the word), which made it much faster, though I am
not sure it was 100% correct in the face of negative backoff weights. But
it wouldn't help in your case, I guess.

Best,
Dávid

On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
wrote:

> Hello,
> I have a context of words and I've built an N-gram language model using
> ./ngram-count. I wish to generate a probability distribution (over the
> entire vocabulary of words) of the next word. I can't seem to be able to
> find a good way to do this with ./ngram.
> What's the best way to do this?
> For example, if my vocabulary has words "apple, banana, carrot", and my
> context is "apple banana banana carrot", I want a distribution like -
> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>
> Thank you,
> Kalpesh Krishna
> http://martiansideofthemoon.github.io/
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>


_______________________________________________
SRILM-User site list
SRILM-User at speech.sri.com
http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170312/30b9bcb4/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Mar 13 10:11:06 2017
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 13 Mar 2017 10:11:06 -0700
Subject: [SRILM User List] Generate Probability Distribution
In-Reply-To: <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>
References: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>
 <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>
Message-ID: <431d6b37-12ae-155f-186e-7e67c1249814@icsi.berkeley.edu>


A brute force solution to this (if you don't want to modify any code)  
is to generate an N-gram count file of the form

apple banana banana carrot apple        1
apple banana banana carrot banana        1
apple banana banana carrot carrot        1

and pass it to

     ngram -lm LM    -order 5 -counts COUNTS -debug 2

If you want to make a minimal code change to enumerate all conditional 
probabilities for any context encountered, you could do so in 
LM::wordProbSum() and have it dump out the word tokens and their log 
probabilities.  Then process some text with ngram -debug 3.

Andreas


On 3/12/2017 12:12 AM, Dávid Nemeskey wrote:
> Hi Kalpesh,
>
> well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) 
> in lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram 
> model). You could simply call it on every word in the vocabulary. 
> However, be warned that this will be very slow for any reasonable 
> vocabulary size (say 10k and up). This function is also what 
> generateWord() calls, that is why the latter is so slow.
>
> If you just wanted the top n most probable words, the situation would 
> be a bit different. Then wordProb() wouldn't be the optimal solution 
> because the trie built by ngram is reversed (meaning you have to go 
> back from the word to the root, and not the other way around), and you 
> had to query all words to get the most probably one. So when I wanted 
> to do this, I built another trie (from the root up to the word), which 
> made it much faster, though I am not sure it was 100% correct in the 
> face of negative backoff weights. But it wouldn't help in your case, I 
> guess.
>
> Best,
> Dávid
>
> On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna 
> <kalpeshk2011 at gmail.com <mailto:kalpeshk2011 at gmail.com>> wrote:
>
>     Hello,
>     I have a context of words and I've built an N-gram language model
>     using ./ngram-count. I wish to generate a probability distribution
>     (over the entire vocabulary of words) of the next word. I can't
>     seem to be able to find a good way to do this with ./ngram.
>     What's the best way to do this?
>     For example, if my vocabulary has words "apple, banana, carrot",
>     and my context is "apple banana banana carrot", I want a
>     distribution like - {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>
>     Thank you,
>     Kalpesh Krishna
>     http://martiansideofthemoon.github.io/
>     <http://martiansideofthemoon.github.io/>
>
>     _______________________________________________
>     SRILM-User site list
>     SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
>     http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>     <http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user>
>
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170313/6e2e61e7/attachment.html>

From nemeskeyd at gmail.com  Tue Mar 14 01:58:12 2017
From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=)
Date: Tue, 14 Mar 2017 09:58:12 +0100
Subject: [SRILM User List] Generate Probability Distribution
In-Reply-To: <CAGo=34UVxeO41O6bFV+25NYdU+15nWeMKDT_8e_RUajojC39Ng@mail.gmail.com>
References: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>
 <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>
 <CAGo=34UVxeO41O6bFV+25NYdU+15nWeMKDT_8e_RUajojC39Ng@mail.gmail.com>
Message-ID: <CAHOrvWduJ5T64n9pQxgu-B8PbZyJmu5fOYRoZQU2ikh_RRt0sA@mail.gmail.com>

Hi Kalpesh,

I could send you a binary, but that, as I mentioned above, is only PAC (not
in the machine learning sense). So there would be some work involved before
- sort the words in my trie by frequency, not alphanumerically
- always check the lower trie node, esp. if the backoff weight is > 0.

These changes shouldn't take much time, and they would cut the cost
tremendously (if you want the top k words, then O(nk) instead of O(Vk)). So
I think it made more sense to send you the code, but I based it on an older
version of SRILM, so if you are using the latest one, it might not be so
simple to port just by looking at my version. If you have a GitHub user
account, though, I could give access to my private repo, and then you would
see exactly what I changed.

Best,
Dávid

On Sun, Mar 12, 2017 at 12:10 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
wrote:

> Hi Dávid,
> Thank you for your response. Are there any existing binaries which will
> help me do this quickly? I don't mind a non-SRILM ARPA file reader either.
> Yes, top N words might be good enough in my use case, especially when they
> cover more than 99% of the probability mass. I like the idea of building a
> trie to do this.
>
> Thank you,
> Kalpesh
>
> On 12 Mar 2017 1:42 p.m., "Dávid Nemeskey" <nemeskeyd at gmail.com> wrote:
>
> Hi Kalpesh,
>
> well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in
> lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model).
> You could simply call it on every word in the vocabulary. However, be
> warned that this will be very slow for any reasonable vocabulary size (say
> 10k and up). This function is also what generateWord() calls, that is why
> the latter is so slow.
>
> If you just wanted the top n most probable words, the situation would be a
> bit different. Then wordProb() wouldn't be the optimal solution because the
> trie built by ngram is reversed (meaning you have to go back from the word
> to the root, and not the other way around), and you had to query all words
> to get the most probably one. So when I wanted to do this, I built another
> trie (from the root up to the word), which made it much faster, though I am
> not sure it was 100% correct in the face of negative backoff weights. But
> it wouldn't help in your case, I guess.
>
> Best,
> Dávid
>
> On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
> wrote:
>
>> Hello,
>> I have a context of words and I've built an N-gram language model using
>> ./ngram-count. I wish to generate a probability distribution (over the
>> entire vocabulary of words) of the next word. I can't seem to be able to
>> find a good way to do this with ./ngram.
>> What's the best way to do this?
>> For example, if my vocabulary has words "apple, banana, carrot", and my
>> context is "apple banana banana carrot", I want a distribution like -
>> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>>
>> Thank you,
>> Kalpesh Krishna
>> http://martiansideofthemoon.github.io/
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170314/4488d620/attachment.html>

From kalpeshk2011 at gmail.com  Tue Mar 14 22:48:33 2017
From: kalpeshk2011 at gmail.com (Kalpesh Krishna)
Date: Wed, 15 Mar 2017 11:18:33 +0530
Subject: [SRILM User List] Generate Probability Distribution
In-Reply-To: <431d6b37-12ae-155f-186e-7e67c1249814@icsi.berkeley.edu>
References: <CAGo=34Xio493d2x6B5XJgsvPFyd4aMk5bs+oL5o3W_d0KKdTfg@mail.gmail.com>
 <CAHOrvWeeQZvdwYgWrCYz3vow47hUEEmhjLceBySm992_TecUZg@mail.gmail.com>
 <431d6b37-12ae-155f-186e-7e67c1249814@icsi.berkeley.edu>
Message-ID: <CAGo=34UJZT9-rm5hKX0r-owg_71XQz76Gxi=3LSEoobWWxhTmw@mail.gmail.com>

Thank you Andreas! This approach is getting me the probabilities really
quickly (within 0.5 seconds including steps of pre and post processing in a
Python wrapper on a single core). It was very satisifying to see
`np.sum(distribution)` returning values like `0.99999994929300007`.
Thank you for your help Dávid! I'd love to have a look at your code. Here
is my Github handle - martiansideofthemoon

With Regards,
Kalpesh Krishna
http://martiansideofthemoon.github.io/


On Mon, Mar 13, 2017 at 10:41 PM, Andreas Stolcke <stolcke at icsi.berkeley.edu
> wrote:

>
> A brute force solution to this (if you don't want to modify any code)  is
> to generate an N-gram count file of the form
>
> apple banana banana carrot apple        1
> apple banana banana carrot banana        1
> apple banana banana carrot carrot        1
>
> and pass it to
>
>     ngram -lm LM    -order 5 -counts COUNTS -debug 2
>
> If you want to make a minimal code change to enumerate all conditional
> probabilities for any context encountered, you could do so in
> LM::wordProbSum() and have it dump out the word tokens and their log
> probabilities.  Then process some text with ngram -debug 3.
>
> Andreas
>
>
>
>
> On 3/12/2017 12:12 AM, Dávid Nemeskey wrote:
>
> Hi Kalpesh,
>
> well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in
> lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model).
> You could simply call it on every word in the vocabulary. However, be
> warned that this will be very slow for any reasonable vocabulary size (say
> 10k and up). This function is also what generateWord() calls, that is why
> the latter is so slow.
>
> If you just wanted the top n most probable words, the situation would be a
> bit different. Then wordProb() wouldn't be the optimal solution because the
> trie built by ngram is reversed (meaning you have to go back from the word
> to the root, and not the other way around), and you had to query all words
> to get the most probably one. So when I wanted to do this, I built another
> trie (from the root up to the word), which made it much faster, though I am
> not sure it was 100% correct in the face of negative backoff weights. But
> it wouldn't help in your case, I guess.
>
> Best,
> Dávid
>
> On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
> wrote:
>
>> Hello,
>> I have a context of words and I've built an N-gram language model using
>> ./ngram-count. I wish to generate a probability distribution (over the
>> entire vocabulary of words) of the next word. I can't seem to be able to
>> find a good way to do this with ./ngram.
>> What's the best way to do this?
>> For example, if my vocabulary has words "apple, banana, carrot", and my
>> context is "apple banana banana carrot", I want a distribution like -
>> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>>
>> Thank you,
>> Kalpesh Krishna
>> http://martiansideofthemoon.github.io/
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>
>
>
> _______________________________________________
> SRILM-User site listSRILM-User at speech.sri.comhttp://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>


-- 
Kalpesh Krishna,
Junior Undergraduate,
Electrical Engineering,
IIT Bombay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170315/4d8ff842/attachment.html>

From maituanbk2012 at gmail.com  Fri Mar 17 07:44:48 2017
From: maituanbk2012 at gmail.com (Van Tuan MAI)
Date: Fri, 17 Mar 2017 15:44:48 +0100
Subject: [SRILM User List] buid a language model multiword
Message-ID: <CADX3Ti1ThBn0kHJiPKGhSwLiTKC78smzGhK3HGwXLWPoRDYDRw@mail.gmail.com>

hello,

Now i have a text file that contain all the word in story and a vocab file
that include not only normaly word but also wrong pronunciation words (a,
b(b1, b2),c(c1, c2, c3)). SO can i add b1, b2, c1, c2 into N-gram models??

thanks in advance

Best,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170317/32834452/attachment.html>

From stolcke at icsi.berkeley.edu  Fri Mar 17 11:54:24 2017
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 17 Mar 2017 11:54:24 -0700
Subject: [SRILM User List] buid a language model multiword
In-Reply-To: <CADX3Ti1ThBn0kHJiPKGhSwLiTKC78smzGhK3HGwXLWPoRDYDRw@mail.gmail.com>
References: <CADX3Ti1ThBn0kHJiPKGhSwLiTKC78smzGhK3HGwXLWPoRDYDRw@mail.gmail.com>
Message-ID: <741fb878-5376-98db-de35-bd1f0e6fd2c1@icsi.berkeley.edu>

On 3/17/2017 7:44 AM, Van Tuan MAI wrote:
> hello,
>
> Now i have a text file that contain all the word in story and a vocab 
> file that include not only normaly word but also wrong pronunciation 
> words (a, b(b1, b2),c(c1, c2, c3)). SO can i add b1, b2, c1, c2 into 
> N-gram models??
>
I'm not sure I fully understand your notation (can you give examples of 
what b, b1, b2, etc. stand for?)  but you can train an LM on "normal"  
or "wrong" words as you wish.  The software makes no difference between 
those.

You have to experiment to find out if mapping "wrong"  to "normal" words 
(usually called "text normalization" or TN) would help the performance 
of your overall system.   The rationale for TN is that is reduces the 
sparseness of your data and thereby improves generalization.  Also, if 
you have a postprocessing step that interprets the words it might help 
to only deal with "normal" words.

Andreas


From maituanbk2012 at gmail.com  Thu Mar 23 05:03:06 2017
From: maituanbk2012 at gmail.com (Van Tuan MAI)
Date: Thu, 23 Mar 2017 13:03:06 +0100
Subject: [SRILM User List] Buid a 3-gram language model for HTK project
Message-ID: <CADX3Ti2ZfP+CxA3En4igHzsE3Gkf3KBqJDyE1965oKnTooY6RA@mail.gmail.com>

hello,
Now i want to buid a 3-gram model for a htk project. So how can i
combine HTK and SRILM for speech recognition?

thanks for your time,

Van Tuan