From asr.naxingyu at gmail.com Sun Jan 8 21:02:48 2017 From: asr.naxingyu at gmail.com (Xingyu Na) Date: Mon, 9 Jan 2017 13:02:48 +0800 Subject: [SRILM User List] srilm download fail Message-ID: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com> Hi, I filled in the download form and clicked the accept button. Then I was redirected to a php page with only some php code like this: =================== $value) { $new_array["$key"] = strip_tags($value); } return $new_array; } function reclog($logfile) { global $newpost; date_default_timezone_set('America/Los_Angeles'); $fecha = date(DATE_RFC2822); $remote_addr = $_SERVER['REMOTE_ADDR']; // REMOTE_HOST is undefined index in current Apache2 server // $remote_host = $_SERVER['REMOTE_HOST'] ?: gethostbyaddr($remote_addr); $remote_host = gethostbyaddr($remote_addr); $fh = fopen("$logfile", 'a+'); fwrite($fh, "$fecha\n"); fwrite($fh, "From_Addr=$remote_addr\n"); fwrite($fh, "From_Host=$remote_host\n"); fwrite($fh, "Name=" . $newpost['WWW_name'] . "\n"); fwrite($fh, "Org=" . $newpost['WWW_org'] . "\n"); fwrite($fh, "Address=" . $newpost['WWW_address'] . "\n"); fwrite($fh, "Email=" . $newpost['WWW_email'] . "\n"); fwrite($fh, "URL=" . $newpost['WWW_url'] . "\n"); fwrite($fh, "File=" . $newpost['WWW_file'] . "\n"); if (!isset($newpost['WWW_list'])) $newpost['WWW_list'] = ""; fwrite($fh, "List=" . $newpost['WWW_list'] . "\n\n"); fclose($fh); } function recemail($maillist) { global $newpost; $email = preg_replace('/\s+/', ' ', $newpost['WWW_email']); $fh = fopen("$maillist", 'a+'); fwrite($fh, $newpost['WWW_name'] . " <$email>\n"); fclose($fh); } function download($file) { if (file_exists($file)) { header('Content-Description: File Transfer'); header('Content-Type: application/gzip'); header('Content-Disposition: attachment; filename='.basename($file)); header('Expires: 0'); header('Cache-Control: must-revalidate'); header('Pragma: public'); header('Content-Length: ' . filesize($file)); readfile($file); exit; } else { header("Content-type: text/plain\n"); header("Status: 404 Not Found\n"); print "$file not found!\n"; } } /**** MAIN ****/ // clean input values $newpost = strip_html_in_array($_POST); // check for proper form entry if (empty($newpost['WWW_name']) || empty($newpost['WWW_email'])) { if (!empty($newpost['WWW_signup'])) { // for sign-up print "Your Name or Email are missing. "; print "Please go back and complete the form. "; exit(0); } else if (empty($newpost['WWW_address'])) { // for download print "Your Name, Address or Email are missing. "; print "Please go back and complete the form. "; exit(0); } } /* DEBUGGING print "Send result: "; print " "; print_r($_POST); print_r($newpost); print " "; exit (0); */ if (!isset($newpost['WWW_list'])) { recemail($maillist_announce); } else if (isset($newpost['WWW_signup'])) { recemail($maillist_users); } if (isset($newpost['WWW_signup'])) { header('Content-Description: Display signup successfully done'); header('Content-Type: text/html'); header('Expires: 0'); header('Cache-Control: must-revalidate'); header('Pragma: public'); print " "; print ""; print " "; print ""; print ""; exit(0); } else { // not signup so it's download reclog($logfile); download("$datadir/" . $newpost['WWW_file']); } ?> =================== I tried safari and chrome. Anyone could help? Thanks! Xingyu -------------- next part -------------- An HTML attachment was scrubbed... URL: From chiachi at speech.sri.com Sun Jan 8 23:14:19 2017 From: chiachi at speech.sri.com (Chiachi Hung) Date: Sun, 8 Jan 2017 23:14:19 -0800 Subject: [SRILM User List] srilm download fail In-Reply-To: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com> References: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com> Message-ID: <03f805f2-1ac7-bd29-239e-6b1869ea31b3@speech.sri.com> Hi Xingyu, Sorry for the inconvenience it might cause you. We have restored the service. Please give it a try. Chiachi On 01/08/2017 09:02 PM, Xingyu Na wrote: > Hi, > > I filled in the download form and clicked the accept button. Then I > was redirected to a php page with only some php code like this: > =================== > > > I tried safari and chrome. Anyone could help? Thanks! > > Xingyu > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From asr.naxingyu at gmail.com Mon Jan 9 00:10:52 2017 From: asr.naxingyu at gmail.com (Xingyu Na) Date: Mon, 9 Jan 2017 16:10:52 +0800 Subject: [SRILM User List] srilm download fail In-Reply-To: <03f805f2-1ac7-bd29-239e-6b1869ea31b3@speech.sri.com> References: <74B934F5-41B3-4EC3-BDFD-1017CC2FB944@gmail.com> <03f805f2-1ac7-bd29-239e-6b1869ea31b3@speech.sri.com> Message-ID: It works. Thank you! X. > 在 2017年1月9日,15:14,Chiachi Hung 写道: > > Hi Xingyu, > > Sorry for the inconvenience it might cause you. We have restored the service. Please give it a try. > > Chiachi > > > On 01/08/2017 09:02 PM, Xingyu Na wrote: >> Hi, >> >> I filled in the download form and clicked the accept button. Then I was redirected to a php page with only some php code like this: >> =================== >> >> >> I tried safari and chrome. Anyone could help? Thanks! >> >> Xingyu >> >> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsuki_stefy at yahoo.com Tue Jan 24 04:14:56 2017 From: tsuki_stefy at yahoo.com (Stefy D.) Date: Tue, 24 Jan 2017 12:14:56 +0000 (UTC) Subject: [SRILM User List] perplexity results References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com> Message-ID: <1323358276.4219167.1485260096850@mail.yahoo.com> Hello. I have a question regarding perplexity. I am using srilm to compute the perplexity of some sentences using a LM trained on a big corpus. Given a sentence and a LM, the perplexity tells how well that sentence fits to the language (as far as i understood). And the lower the perplexity, the better the sentence fits. $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text Wikipedia.en-es.es -lm lm/lmodel_es.lm $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl testlabeled.en-es.es  > perplexity_es_testlabeled.ppl I did the same on EN and on ES and here are some results I got: Sixty-six parent coordinators were laid off," the draft complaint says, "and not merely excessed.1 sentences, 14 words, 0 OOVs0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9 Mexico's Enrique Pena Nieto faces tough start1 sentences, 7 words, 0 OOVs0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964 The NATO mission officially ended Oct. 31.1 sentences, 7 words, 0 OOVs0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6 Sesenta y seis padres coordinadores fueron despedidos," el proyecto de denuncia, dice, "y no simplemente excessed.1 sentences, 16 words, 0 OOVs0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72 México Enrique Peña Nieto enfrenta duras comienzo1 sentences, 7 words, 0 OOVs0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7 Why are the perplexities for the EN sentences so big? The smallest ppl i get for an EN sentence is about 250. The spanish sentences have some errors, so i was expecting big ppl numbers. Should i change something in the way i compute the lms? Thank you very much!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From nemeskeyd at gmail.com Tue Jan 24 04:57:58 2017 From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=) Date: Tue, 24 Jan 2017 13:57:58 +0100 Subject: [SRILM User List] perplexity results In-Reply-To: <1323358276.4219167.1485260096850@mail.yahoo.com> References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com> <1323358276.4219167.1485260096850@mail.yahoo.com> Message-ID: Hi, it is hard to tell without knowing e.g. the training set. But I would try running ngram with higher values for -debug. I think even -debug 2 tells you the logprob of the individual words. That could be a start. I actually added another debug level (100), where I print the 5 most likely candidates (requires a "forward trie" in addition to the default "backwards" one to be of usable speed) to get a sense of the proportions and how the model and the text differs. Also, just wondering. Is the training corpus bilingual (en-es)? Best, Dávid Nemeskey On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. wrote: > Hello. I have a question regarding perplexity. I am using srilm to compute > the perplexity of some sentences using a LM trained on a big corpus. Given a > sentence and a LM, the perplexity tells how well that sentence fits to the > language (as far as i understood). And the lower the perplexity, the better > the sentence fits. > > $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text > Wikipedia.en-es.es -lm lm/lmodel_es.lm > > $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl > testlabeled.en-es.es > perplexity_es_testlabeled.ppl > > I did the same on EN and on ES and here are some results I got: > > Sixty-six parent coordinators were laid off," the draft complaint says, "and > not merely excessed. > 1 sentences, 14 words, 0 OOVs > 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9 > > Mexico's Enrique Pena Nieto faces tough start > 1 sentences, 7 words, 0 OOVs > 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964 > > The NATO mission officially ended Oct. 31. > 1 sentences, 7 words, 0 OOVs > 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6 > > Sesenta y seis padres coordinadores fueron despedidos," el proyecto de > denuncia, dice, "y no simplemente excessed. > 1 sentences, 16 words, 0 OOVs > 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72 > > México Enrique Peña Nieto enfrenta duras comienzo > 1 sentences, 7 words, 0 OOVs > 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7 > > > Why are the perplexities for the EN sentences so big? The smallest ppl i get > for an EN sentence is about 250. The spanish sentences have some errors, so > i was expecting big ppl numbers. Should i change something in the way i > compute the lms? > > Thank you very much!! > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user From mstefd22 at gmail.com Tue Jan 24 05:46:12 2017 From: mstefd22 at gmail.com (Stef M) Date: Tue, 24 Jan 2017 14:46:12 +0100 Subject: [SRILM User List] perplexity results Message-ID: Hello David. Thank you very much for answering. I am not sure if you received my reply as the yahoo servers have problems right now so i switched to gmail (sorry if you received already the email). I used Wikipedia parallel corpus en-es for training the two lms ( http://opus.lingfil.uu.se/Wikipedia.php, 1.8M sentence pairs). I used the -debug 2 as you said and below are the results. Could you please help me understand why the perplexity numbers are so high for the EN sentences since they are well formed? For testing spanish i used machine translated output so i was expecting big numbers for ppl. Thank you! Sixty-six parent coordinators were laid off," the draft complaint says, "and not merely excessed. p( Sixty-six | ) = [1gram] 2.16995e-09 [ -8.66355 ] p( parent | Sixty-six ...) = [1gram] 1.0949e-05 [ -4.96063 ] p( coordinators | parent ...) = [1gram] 3.37871e-07 [ -6.47125 ] p( were | coordinators ...) = [1gram] 0.00120231 [ -2.91998 ] p( laid | were ...) = [2gram] 0.000696035 [ -3.15737 ] p( off," | laid ...) = [1gram] 2.33407e-08 [ -7.63189 ] p( the | off," ...) = [2gram] 0.0469306 [ -1.32854 ] p( draft | the ...) = [2gram] 7.67904e-05 [ -4.11469 ] p( complaint | draft ...) = [1gram] 8.13141e-07 [ -6.08983 ] p( says, | complaint ...) = [1gram] 1.17395e-05 [ -4.93035 ] p( "and | says, ...) = [2gram] 0.00147669 [ -2.83071 ] p( not | "and ...) = [1gram] 0.000275198 [ -3.56035 ] p( merely | not ...) = [2gram] 0.00173666 [ -2.76029 ] p( | merely ...) = [1gram] 0.0796503 [ -1.09881 ] p( | ...) = [1gram] 0.0258359 [ -1.58778 ] 1 sentences, 14 words, 0 OOVs 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9 Mexico's Enrique Pena Nieto faces tough start p( Mexico's | ) = [2gram] 1.31547e-06 [ -5.88092 ] p( Enrique | Mexico's ...) = [1gram] 1.34348e-05 [ -4.87177 ] p( Pena | Enrique ...) = [1gram] 1.83116e-06 [ -5.73727 ] p( Nieto | Pena ...) = [1gram] 1.6622e-06 [ -5.77932 ] p( faces | Nieto ...) = [1gram] 1.61354e-05 [ -4.79222 ] p( tough | faces ...) = [1gram] 2.80928e-06 [ -5.5514 ] p( start | tough ...) = [1gram] 2.90611e-05 [ -4.53669 ] p( | start ...) = [1gram] 0.00941231 [ -2.0263 ] 1 sentences, 7 words, 0 OOVs 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964 The NATO mission officially ended Oct. 31. p( The | ) = [2gram] 0.143584 [ -0.842893 ] p( NATO | The ...) = [3gram] 5.55208e-06 [ -5.25554 ] p( mission | NATO ...) = [1gram] 3.10877e-05 [ -4.50741 ] p( officially | mission ...) = [1gram] 2.81221e-05 [ -4.55095 ] p( ended | officially ...) = [2gram] 0.00976927 [ -2.01014 ] p( Oct. | ended ...) = [1gram] 2.4073e-07 [ -6.61847 ] p( 31. | Oct. ...) = [1gram] 3.60453e-06 [ -5.44315 ] p( | 31. ...) = [2gram] 0.907671 [ -0.0420717 ] 1 sentences, 7 words, 0 OOVs 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6 -------------- next part -------------- An HTML attachment was scrubbed... URL: From nemeskeyd at gmail.com Tue Jan 24 07:27:56 2017 From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=) Date: Tue, 24 Jan 2017 16:27:56 +0100 Subject: [SRILM User List] perplexity results In-Reply-To: References: Message-ID: If you have a look at the content of the first square brackets, you can see that very few words come from 2-grams or higher. What this means is the model could almost never find the context in the training data and had to fall back on the unigram model quite a lot, so what you see here is basically the performance of a -order 1 model -- but the numbers seem quite high even for that... Are you sure the commands you issued were the ones in your mail? If yes, it would be interesting to see statistics of the corpus you used. How big is the vocabulary? How big are the unigram frequencies? Is it possible that the distribution has a very long tail, and almost all words occur only 1-2 times? I would also do some preprocessing on the data, like lowercasing everything and running a tokenizer on it to split e.g. '"and' to the two tokens '"' and 'and'. On Tue, Jan 24, 2017 at 2:46 PM, Stef M wrote: > Hello David. > > Thank you very much for answering. I am not sure if you received my reply as > the yahoo servers have problems right now so i switched to gmail (sorry if > you received already the email). > > > I used Wikipedia parallel corpus en-es for training the two lms > (http://opus.lingfil.uu.se/Wikipedia.php, 1.8M sentence pairs). I used the > -debug 2 as you said and below are the results. Could you please help me > understand why the perplexity numbers are so high for the EN sentences since > they are well formed? For testing spanish i used machine translated output > so i was expecting big numbers for ppl. Thank you! > > > Sixty-six parent coordinators were laid off," the draft complaint says, "and > not merely excessed. > p( Sixty-six | ) = [1gram] 2.16995e-09 [ -8.66355 ] > p( parent | Sixty-six ...) = [1gram] 1.0949e-05 [ -4.96063 ] > p( coordinators | parent ...) = [1gram] 3.37871e-07 [ -6.47125 ] > p( were | coordinators ...) = [1gram] 0.00120231 [ -2.91998 ] > p( laid | were ...) = [2gram] 0.000696035 [ -3.15737 ] > p( off," | laid ...) = [1gram] 2.33407e-08 [ -7.63189 ] > p( the | off," ...) = [2gram] 0.0469306 [ -1.32854 ] > p( draft | the ...) = [2gram] 7.67904e-05 [ -4.11469 ] > p( complaint | draft ...) = [1gram] 8.13141e-07 [ -6.08983 ] > p( says, | complaint ...) = [1gram] 1.17395e-05 [ -4.93035 ] > p( "and | says, ...) = [2gram] 0.00147669 [ -2.83071 ] > p( not | "and ...) = [1gram] 0.000275198 [ -3.56035 ] > p( merely | not ...) = [2gram] 0.00173666 [ -2.76029 ] > p( | merely ...) = [1gram] 0.0796503 [ -1.09881 ] > p( | ...) = [1gram] 0.0258359 [ -1.58778 ] > 1 sentences, 14 words, 0 OOVs > 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9 > > > Mexico's Enrique Pena Nieto faces tough start > p( Mexico's | ) = [2gram] 1.31547e-06 [ -5.88092 ] > p( Enrique | Mexico's ...) = [1gram] 1.34348e-05 [ -4.87177 ] > p( Pena | Enrique ...) = [1gram] 1.83116e-06 [ -5.73727 ] > p( Nieto | Pena ...) = [1gram] 1.6622e-06 [ -5.77932 ] > p( faces | Nieto ...) = [1gram] 1.61354e-05 [ -4.79222 ] > p( tough | faces ...) = [1gram] 2.80928e-06 [ -5.5514 ] > p( start | tough ...) = [1gram] 2.90611e-05 [ -4.53669 ] > p( | start ...) = [1gram] 0.00941231 [ -2.0263 ] > 1 sentences, 7 words, 0 OOVs > 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964 > > > > The NATO mission officially ended Oct. 31. > p( The | ) = [2gram] 0.143584 [ -0.842893 ] > p( NATO | The ...) = [3gram] 5.55208e-06 [ -5.25554 ] > p( mission | NATO ...) = [1gram] 3.10877e-05 [ -4.50741 ] > p( officially | mission ...) = [1gram] 2.81221e-05 [ -4.55095 ] > p( ended | officially ...) = [2gram] 0.00976927 [ -2.01014 ] > p( Oct. | ended ...) = [1gram] 2.4073e-07 [ -6.61847 ] > p( 31. | Oct. ...) = [1gram] 3.60453e-06 [ -5.44315 ] > p( | 31. ...) = [2gram] 0.907671 [ -0.0420717 ] > 1 sentences, 7 words, 0 OOVs > 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6 From stolcke at icsi.berkeley.edu Tue Jan 24 10:06:00 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 Jan 2017 10:06:00 -0800 Subject: [SRILM User List] perplexity results In-Reply-To: References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com> <1323358276.4219167.1485260096850@mail.yahoo.com> Message-ID: <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu> Make sure text normalization is consistent between training and test data (e.g, capitalization - consider mapping to lower-case, and encoding of diacritics). Also, you're using -unk, i.e., your model contains an unknown-word token, which means OOVs get assigned a non-zero, but possibly very low probability. This could mask big divergence in the vocabulary, and the high perplexity could be the result of lots of OOV words that all get a low probability via . Try training without -unk and observe the tally of OOVs in the ppl output. Andreas On 1/24/2017 4:57 AM, Dávid Nemeskey wrote: > Hi, > > it is hard to tell without knowing e.g. the training set. But I would > try running ngram with higher values for -debug. I think even -debug 2 > tells you the logprob of the individual words. That could be a start. > I actually added another debug level (100), where I print the 5 most > likely candidates (requires a "forward trie" in addition to the > default "backwards" one to be of usable speed) to get a sense of the > proportions and how the model and the text differs. > > Also, just wondering. Is the training corpus bilingual (en-es)? > > Best, > Dávid Nemeskey > > On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. wrote: >> Hello. I have a question regarding perplexity. I am using srilm to compute >> the perplexity of some sentences using a LM trained on a big corpus. Given a >> sentence and a LM, the perplexity tells how well that sentence fits to the >> language (as far as i understood). And the lower the perplexity, the better >> the sentence fits. >> >> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text >> Wikipedia.en-es.es -lm lm/lmodel_es.lm >> >> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl >> testlabeled.en-es.es > perplexity_es_testlabeled.ppl >> >> I did the same on EN and on ES and here are some results I got: >> >> Sixty-six parent coordinators were laid off," the draft complaint says, "and >> not merely excessed. >> 1 sentences, 14 words, 0 OOVs >> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9 >> >> Mexico's Enrique Pena Nieto faces tough start >> 1 sentences, 7 words, 0 OOVs >> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964 >> >> The NATO mission officially ended Oct. 31. >> 1 sentences, 7 words, 0 OOVs >> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6 >> >> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de >> denuncia, dice, "y no simplemente excessed. >> 1 sentences, 16 words, 0 OOVs >> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72 >> >> México Enrique Peña Nieto enfrenta duras comienzo >> 1 sentences, 7 words, 0 OOVs >> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7 >> >> >> Why are the perplexities for the EN sentences so big? The smallest ppl i get >> for an EN sentence is about 250. The spanish sentences have some errors, so >> i was expecting big ppl numbers. Should i change something in the way i >> compute the lms? >> >> Thank you very much!! >> >> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > From xulikui123321 at 163.com Wed Feb 8 23:20:37 2017 From: xulikui123321 at 163.com (=?GBK?B?0Ow=?=) Date: Thu, 9 Feb 2017 15:20:37 +0800 (CST) Subject: [SRILM User List] perplexity results In-Reply-To: <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu> References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com> <1323358276.4219167.1485260096850@mail.yahoo.com> <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu> Message-ID: <3ced5bc5.74d0.15a21bed04e.Coremail.xulikui123321@163.com> Hi, Andreas: I want to rewrite the program ngram to web service, so i can query perplexity on Browser page. in ~/srilm/lm/src directory, i rewrite ngram.cc and named ngramService.cc then i compile it with follow command: g++ -m64 -I ~/srilm/include/ -c ngramService.cc -o ngramService it compile succecced. But when i execute it, System prompt:./ngramService: cannot execute binary file even after chmod +x ngramService command. am i miss something in compile command ? my machine is 64, when i type unama -a: Linux bjzw_48_43 2.6.32-504.23.4.el6.x86_64 #1 SMP Fri May 29 10:16:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux -------------- next part -------------- An HTML attachment was scrubbed... URL: From mstefd22 at gmail.com Thu Feb 9 04:21:29 2017 From: mstefd22 at gmail.com (Stef M) Date: Thu, 9 Feb 2017 13:21:29 +0100 Subject: [SRILM User List] perplexity results In-Reply-To: <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu> References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com> <1323358276.4219167.1485260096850@mail.yahoo.com> <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu> Message-ID: Hello David and Andreas, sorry for replying so late. Thank you very much for your suggestions. Indeed, I had forgotten to preprocess the test set. I got better results after preprocessing, so thanks a lot for pointing it out! 2017-01-24 19:06 GMT+01:00 Andreas Stolcke : > Make sure text normalization is consistent between training and test data > (e.g, capitalization - consider mapping to lower-case, and encoding of > diacritics). > > Also, you're using -unk, i.e., your model contains an unknown-word token, > which means OOVs get assigned a non-zero, but possibly very low > probability. This could mask big divergence in the vocabulary, and the > high perplexity could be the result of lots of OOV words that all get a low > probability via . Try training without -unk and observe the tally of > OOVs in the ppl output. > > Andreas > > On 1/24/2017 4:57 AM, Dávid Nemeskey wrote: > >> Hi, >> >> it is hard to tell without knowing e.g. the training set. But I would >> try running ngram with higher values for -debug. I think even -debug 2 >> tells you the logprob of the individual words. That could be a start. >> I actually added another debug level (100), where I print the 5 most >> likely candidates (requires a "forward trie" in addition to the >> default "backwards" one to be of usable speed) to get a sense of the >> proportions and how the model and the text differs. >> >> Also, just wondering. Is the training corpus bilingual (en-es)? >> >> Best, >> Dávid Nemeskey >> >> On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. wrote: >> >>> Hello. I have a question regarding perplexity. I am using srilm to >>> compute >>> the perplexity of some sentences using a LM trained on a big corpus. >>> Given a >>> sentence and a LM, the perplexity tells how well that sentence fits to >>> the >>> language (as far as i understood). And the lower the perplexity, the >>> better >>> the sentence fits. >>> >>> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text >>> Wikipedia.en-es.es -lm lm/lmodel_es.lm >>> >>> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl >>> testlabeled.en-es.es > perplexity_es_testlabeled.ppl >>> >>> I did the same on EN and on ES and here are some results I got: >>> >>> Sixty-six parent coordinators were laid off," the draft complaint says, >>> "and >>> not merely excessed. >>> 1 sentences, 14 words, 0 OOVs >>> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9 >>> >>> Mexico's Enrique Pena Nieto faces tough start >>> 1 sentences, 7 words, 0 OOVs >>> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964 >>> >>> The NATO mission officially ended Oct. 31. >>> 1 sentences, 7 words, 0 OOVs >>> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6 >>> >>> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de >>> denuncia, dice, "y no simplemente excessed. >>> 1 sentences, 16 words, 0 OOVs >>> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72 >>> >>> México Enrique Peña Nieto enfrenta duras comienzo >>> 1 sentences, 7 words, 0 OOVs >>> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7 >>> >>> >>> Why are the perplexities for the EN sentences so big? The smallest ppl i >>> get >>> for an EN sentence is about 250. The spanish sentences have some errors, >>> so >>> i was expecting big ppl numbers. Should i change something in the way i >>> compute the lms? >>> >>> Thank you very much!! >>> >>> >>> >>> _______________________________________________ >>> SRILM-User site list >>> SRILM-User at speech.sri.com >>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user >>> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user >> >> > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From nshmyrev at yandex.ru Thu Feb 9 05:25:45 2017 From: nshmyrev at yandex.ru (Nickolay V. Shmyrev) Date: Thu, 09 Feb 2017 16:25:45 +0300 Subject: [SRILM User List] perplexity results In-Reply-To: <3ced5bc5.74d0.15a21bed04e.Coremail.xulikui123321@163.com> References: <1323358276.4219167.1485260096850.ref@mail.yahoo.com> <1323358276.4219167.1485260096850@mail.yahoo.com> <5acfde11-cb11-a6f3-2d4b-c814684f3880@icsi.berkeley.edu> <3ced5bc5.74d0.15a21bed04e.Coremail.xulikui123321@163.com> Message-ID: <652881486646745@web7m.yandex.ru> The option `-c` compiles an object file, a temporary file you can not execute. To create executable you need to link object files and the libraries. You can learn more about basic GCC usage from documentation, for example this book is good: http://www.network-theory.co.uk/docs/gccintro/index.html You need Chapter 2 called "Compiling a C program". 09.02.2017, 10:59, "徐" : > Hi, Andreas: > >     I want to rewrite the program ngram to web service, so i can query perplexity on Browser page. >     in ~/srilm/lm/src directory, i rewrite ngram.cc and named ngramService.cc >      then i compile it with follow command: >     g++ -m64 -I ~/srilm/include/ -c ngramService.cc -o ngramService >     it compile succecced. >       But when i execute it, > System prompt:./ngramService: cannot execute binary file > even  after chmod +x ngramService command. > > am i miss something in compile command ? > > my machine is 64, when i type unama -a: > Linux bjzw_48_43 2.6.32-504.23.4.el6.x86_64 #1 SMP Fri May 29 10:16:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > , > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user From kalpeshk2011 at gmail.com Sat Mar 11 11:32:11 2017 From: kalpeshk2011 at gmail.com (Kalpesh Krishna) Date: Sun, 12 Mar 2017 01:02:11 +0530 Subject: [SRILM User List] Generate Probability Distribution Message-ID: Hello, I have a context of words and I've built an N-gram language model using ./ngram-count. I wish to generate a probability distribution (over the entire vocabulary of words) of the next word. I can't seem to be able to find a good way to do this with ./ngram. What's the best way to do this? For example, if my vocabulary has words "apple, banana, carrot", and my context is "apple banana banana carrot", I want a distribution like - {"apple": 0.25, "banana": 0.5, "carrot": 0.25}. Thank you, Kalpesh Krishna http://martiansideofthemoon.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From nemeskeyd at gmail.com Sun Mar 12 00:12:07 2017 From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=) Date: Sun, 12 Mar 2017 09:12:07 +0100 Subject: [SRILM User List] Generate Probability Distribution In-Reply-To: References: Message-ID: Hi Kalpesh, well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model). You could simply call it on every word in the vocabulary. However, be warned that this will be very slow for any reasonable vocabulary size (say 10k and up). This function is also what generateWord() calls, that is why the latter is so slow. If you just wanted the top n most probable words, the situation would be a bit different. Then wordProb() wouldn't be the optimal solution because the trie built by ngram is reversed (meaning you have to go back from the word to the root, and not the other way around), and you had to query all words to get the most probably one. So when I wanted to do this, I built another trie (from the root up to the word), which made it much faster, though I am not sure it was 100% correct in the face of negative backoff weights. But it wouldn't help in your case, I guess. Best, Dávid On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna wrote: > Hello, > I have a context of words and I've built an N-gram language model using > ./ngram-count. I wish to generate a probability distribution (over the > entire vocabulary of words) of the next word. I can't seem to be able to > find a good way to do this with ./ngram. > What's the best way to do this? > For example, if my vocabulary has words "apple, banana, carrot", and my > context is "apple banana banana carrot", I want a distribution like - > {"apple": 0.25, "banana": 0.5, "carrot": 0.25}. > > Thank you, > Kalpesh Krishna > http://martiansideofthemoon.github.io/ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kalpeshk2011 at gmail.com Sun Mar 12 04:10:33 2017 From: kalpeshk2011 at gmail.com (Kalpesh Krishna) Date: Sun, 12 Mar 2017 16:40:33 +0530 Subject: [SRILM User List] Generate Probability Distribution In-Reply-To: References: Message-ID: Hi Dávid, Thank you for your response. Are there any existing binaries which will help me do this quickly? I don't mind a non-SRILM ARPA file reader either. Yes, top N words might be good enough in my use case, especially when they cover more than 99% of the probability mass. I like the idea of building a trie to do this. Thank you, Kalpesh On 12 Mar 2017 1:42 p.m., "Dávid Nemeskey" wrote: Hi Kalpesh, well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model). You could simply call it on every word in the vocabulary. However, be warned that this will be very slow for any reasonable vocabulary size (say 10k and up). This function is also what generateWord() calls, that is why the latter is so slow. If you just wanted the top n most probable words, the situation would be a bit different. Then wordProb() wouldn't be the optimal solution because the trie built by ngram is reversed (meaning you have to go back from the word to the root, and not the other way around), and you had to query all words to get the most probably one. So when I wanted to do this, I built another trie (from the root up to the word), which made it much faster, though I am not sure it was 100% correct in the face of negative backoff weights. But it wouldn't help in your case, I guess. Best, Dávid On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna wrote: > Hello, > I have a context of words and I've built an N-gram language model using > ./ngram-count. I wish to generate a probability distribution (over the > entire vocabulary of words) of the next word. I can't seem to be able to > find a good way to do this with ./ngram. > What's the best way to do this? > For example, if my vocabulary has words "apple, banana, carrot", and my > context is "apple banana banana carrot", I want a distribution like - > {"apple": 0.25, "banana": 0.5, "carrot": 0.25}. > > Thank you, > Kalpesh Krishna > http://martiansideofthemoon.github.io/ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Mar 13 10:11:06 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 13 Mar 2017 10:11:06 -0700 Subject: [SRILM User List] Generate Probability Distribution In-Reply-To: References: Message-ID: <431d6b37-12ae-155f-186e-7e67c1249814@icsi.berkeley.edu> A brute force solution to this (if you don't want to modify any code) is to generate an N-gram count file of the form apple banana banana carrot apple 1 apple banana banana carrot banana 1 apple banana banana carrot carrot 1 and pass it to ngram -lm LM -order 5 -counts COUNTS -debug 2 If you want to make a minimal code change to enumerate all conditional probabilities for any context encountered, you could do so in LM::wordProbSum() and have it dump out the word tokens and their log probabilities. Then process some text with ngram -debug 3. Andreas On 3/12/2017 12:12 AM, Dávid Nemeskey wrote: > Hi Kalpesh, > > well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) > in lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram > model). You could simply call it on every word in the vocabulary. > However, be warned that this will be very slow for any reasonable > vocabulary size (say 10k and up). This function is also what > generateWord() calls, that is why the latter is so slow. > > If you just wanted the top n most probable words, the situation would > be a bit different. Then wordProb() wouldn't be the optimal solution > because the trie built by ngram is reversed (meaning you have to go > back from the word to the root, and not the other way around), and you > had to query all words to get the most probably one. So when I wanted > to do this, I built another trie (from the root up to the word), which > made it much faster, though I am not sure it was 100% correct in the > face of negative backoff weights. But it wouldn't help in your case, I > guess. > > Best, > Dávid > > On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna > > wrote: > > Hello, > I have a context of words and I've built an N-gram language model > using ./ngram-count. I wish to generate a probability distribution > (over the entire vocabulary of words) of the next word. I can't > seem to be able to find a good way to do this with ./ngram. > What's the best way to do this? > For example, if my vocabulary has words "apple, banana, carrot", > and my context is "apple banana banana carrot", I want a > distribution like - {"apple": 0.25, "banana": 0.5, "carrot": 0.25}. > > Thank you, > Kalpesh Krishna > http://martiansideofthemoon.github.io/ > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > > > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From nemeskeyd at gmail.com Tue Mar 14 01:58:12 2017 From: nemeskeyd at gmail.com (=?UTF-8?Q?D=C3=A1vid_Nemeskey?=) Date: Tue, 14 Mar 2017 09:58:12 +0100 Subject: [SRILM User List] Generate Probability Distribution In-Reply-To: References: Message-ID: Hi Kalpesh, I could send you a binary, but that, as I mentioned above, is only PAC (not in the machine learning sense). So there would be some work involved before - sort the words in my trie by frequency, not alphanumerically - always check the lower trie node, esp. if the backoff weight is > 0. These changes shouldn't take much time, and they would cut the cost tremendously (if you want the top k words, then O(nk) instead of O(Vk)). So I think it made more sense to send you the code, but I based it on an older version of SRILM, so if you are using the latest one, it might not be so simple to port just by looking at my version. If you have a GitHub user account, though, I could give access to my private repo, and then you would see exactly what I changed. Best, Dávid On Sun, Mar 12, 2017 at 12:10 PM, Kalpesh Krishna wrote: > Hi Dávid, > Thank you for your response. Are there any existing binaries which will > help me do this quickly? I don't mind a non-SRILM ARPA file reader either. > Yes, top N words might be good enough in my use case, especially when they > cover more than 99% of the probability mass. I like the idea of building a > trie to do this. > > Thank you, > Kalpesh > > On 12 Mar 2017 1:42 p.m., "Dávid Nemeskey" wrote: > > Hi Kalpesh, > > well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in > lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model). > You could simply call it on every word in the vocabulary. However, be > warned that this will be very slow for any reasonable vocabulary size (say > 10k and up). This function is also what generateWord() calls, that is why > the latter is so slow. > > If you just wanted the top n most probable words, the situation would be a > bit different. Then wordProb() wouldn't be the optimal solution because the > trie built by ngram is reversed (meaning you have to go back from the word > to the root, and not the other way around), and you had to query all words > to get the most probably one. So when I wanted to do this, I built another > trie (from the root up to the word), which made it much faster, though I am > not sure it was 100% correct in the face of negative backoff weights. But > it wouldn't help in your case, I guess. > > Best, > Dávid > > On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna > wrote: > >> Hello, >> I have a context of words and I've built an N-gram language model using >> ./ngram-count. I wish to generate a probability distribution (over the >> entire vocabulary of words) of the next word. I can't seem to be able to >> find a good way to do this with ./ngram. >> What's the best way to do this? >> For example, if my vocabulary has words "apple, banana, carrot", and my >> context is "apple banana banana carrot", I want a distribution like - >> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}. >> >> Thank you, >> Kalpesh Krishna >> http://martiansideofthemoon.github.io/ >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user >> > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kalpeshk2011 at gmail.com Tue Mar 14 22:48:33 2017 From: kalpeshk2011 at gmail.com (Kalpesh Krishna) Date: Wed, 15 Mar 2017 11:18:33 +0530 Subject: [SRILM User List] Generate Probability Distribution In-Reply-To: <431d6b37-12ae-155f-186e-7e67c1249814@icsi.berkeley.edu> References: <431d6b37-12ae-155f-186e-7e67c1249814@icsi.berkeley.edu> Message-ID: Thank you Andreas! This approach is getting me the probabilities really quickly (within 0.5 seconds including steps of pre and post processing in a Python wrapper on a single core). It was very satisifying to see `np.sum(distribution)` returning values like `0.99999994929300007`. Thank you for your help Dávid! I'd love to have a look at your code. Here is my Github handle - martiansideofthemoon With Regards, Kalpesh Krishna http://martiansideofthemoon.github.io/ On Mon, Mar 13, 2017 at 10:41 PM, Andreas Stolcke wrote: > > A brute force solution to this (if you don't want to modify any code) is > to generate an N-gram count file of the form > > apple banana banana carrot apple 1 > apple banana banana carrot banana 1 > apple banana banana carrot carrot 1 > > and pass it to > > ngram -lm LM -order 5 -counts COUNTS -debug 2 > > If you want to make a minimal code change to enumerate all conditional > probabilities for any context encountered, you could do so in > LM::wordProbSum() and have it dump out the word tokens and their log > probabilities. Then process some text with ngram -debug 3. > > Andreas > > > > > On 3/12/2017 12:12 AM, Dávid Nemeskey wrote: > > Hi Kalpesh, > > well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in > lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model). > You could simply call it on every word in the vocabulary. However, be > warned that this will be very slow for any reasonable vocabulary size (say > 10k and up). This function is also what generateWord() calls, that is why > the latter is so slow. > > If you just wanted the top n most probable words, the situation would be a > bit different. Then wordProb() wouldn't be the optimal solution because the > trie built by ngram is reversed (meaning you have to go back from the word > to the root, and not the other way around), and you had to query all words > to get the most probably one. So when I wanted to do this, I built another > trie (from the root up to the word), which made it much faster, though I am > not sure it was 100% correct in the face of negative backoff weights. But > it wouldn't help in your case, I guess. > > Best, > Dávid > > On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna > wrote: > >> Hello, >> I have a context of words and I've built an N-gram language model using >> ./ngram-count. I wish to generate a probability distribution (over the >> entire vocabulary of words) of the next word. I can't seem to be able to >> find a good way to do this with ./ngram. >> What's the best way to do this? >> For example, if my vocabulary has words "apple, banana, carrot", and my >> context is "apple banana banana carrot", I want a distribution like - >> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}. >> >> Thank you, >> Kalpesh Krishna >> http://martiansideofthemoon.github.io/ >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user >> > > > > _______________________________________________ > SRILM-User site listSRILM-User at speech.sri.comhttp://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > -- Kalpesh Krishna, Junior Undergraduate, Electrical Engineering, IIT Bombay -------------- next part -------------- An HTML attachment was scrubbed... URL: From maituanbk2012 at gmail.com Fri Mar 17 07:44:48 2017 From: maituanbk2012 at gmail.com (Van Tuan MAI) Date: Fri, 17 Mar 2017 15:44:48 +0100 Subject: [SRILM User List] buid a language model multiword Message-ID: hello, Now i have a text file that contain all the word in story and a vocab file that include not only normaly word but also wrong pronunciation words (a, b(b1, b2),c(c1, c2, c3)). SO can i add b1, b2, c1, c2 into N-gram models?? thanks in advance Best, -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Mar 17 11:54:24 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 17 Mar 2017 11:54:24 -0700 Subject: [SRILM User List] buid a language model multiword In-Reply-To: References: Message-ID: <741fb878-5376-98db-de35-bd1f0e6fd2c1@icsi.berkeley.edu> On 3/17/2017 7:44 AM, Van Tuan MAI wrote: > hello, > > Now i have a text file that contain all the word in story and a vocab > file that include not only normaly word but also wrong pronunciation > words (a, b(b1, b2),c(c1, c2, c3)). SO can i add b1, b2, c1, c2 into > N-gram models?? > I'm not sure I fully understand your notation (can you give examples of what b, b1, b2, etc. stand for?) but you can train an LM on "normal" or "wrong" words as you wish. The software makes no difference between those. You have to experiment to find out if mapping "wrong" to "normal" words (usually called "text normalization" or TN) would help the performance of your overall system. The rationale for TN is that is reduces the sparseness of your data and thereby improves generalization. Also, if you have a postprocessing step that interprets the words it might help to only deal with "normal" words. Andreas From maituanbk2012 at gmail.com Thu Mar 23 05:03:06 2017 From: maituanbk2012 at gmail.com (Van Tuan MAI) Date: Thu, 23 Mar 2017 13:03:06 +0100 Subject: [SRILM User List] Buid a 3-gram language model for HTK project Message-ID: hello, Now i want to buid a 3-gram model for a htk project. So how can i combine HTK and SRILM for speech recognition? thanks for your time, Van Tuan