From 20076223 at sun.ac.za Sat Aug 24 14:31:37 2019 From: 20076223 at sun.ac.za (Van der Merwe, W, Mnr [20076223@sun.ac.za]) Date: Sat, 24 Aug 2019 21:31:37 +0000 Subject: [SRILM User List] [External Sender] Renormalising probabilities to 1 Message-ID: Hi, I am a student at Stellenbosch University currently using the SRILM toolkit for one of my projects. I would like to know if the toolkit is able to renormalize the probabilities, given an ARPA file, so that they sum to 1. I've read the documentation and am aware of the -renorm parameter option, however, I am not seeking to renormalize backoff weights, only the probabilities. The reason I ask this is that I am writing an ARPA file myself, taking probabilities produced by a neural network. Because these probabilities are estimated by a neural net, they tend not to sum not 1 perfectly. I am hoping that SRILM can correct this. Otherwise, I will have to write a script to brute force it. Werner [https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcdn.sun.ac.za%2F100%2FProductionFooter.jpg&data=01%7C01%7Csrilm-user%40speech.sri.com%7Cee116d5a06254481184a08d728da72ff%7C40779d3379c44626b8bf140c4d5e9075%7C1&sdata=Eo0qwcSonmci3xzmsFrXEXOn2dTJrTi%2F2UJ75suIXLQ%3D&reserved=0] The integrity and confidentiality of this email are governed by these terms. Disclaimer Die integriteit en vertroulikheid van hierdie e-pos word deur die volgende bepalings bereël. Vrywaringsklousule -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Aug 24 17:01:11 2019 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 24 Aug 2019 17:01:11 -0700 Subject: [SRILM User List] [External Sender] Renormalising probabilities to 1 In-Reply-To: References: Message-ID: <3aaed5ec-18e9-b4b7-9c58-21b8f8a1a7a5@icsi.berkeley.edu> You are correct, -renorm normalizes the model assuming the probabilities for each history sum up to <= 1. There is no option to rescale the ngram probabilities themselves. However, you are already doing your own processing to transfer the NN outputs to the ngram model format. It would be trivial to add a normalization step that sums them up (for each history), and rescales them if the sum is > 1. The more serious question is, how much probability mass should you allocate to unseen ngrams?  If the NN estimates probabilities that sum to 1 you have a normalized model, but not a very good one because it doesn't anticipate ever seeing a word that you haven't already seen in that context.  So you should find a way to estimate the "unseen word" probability in your framework, and then include that in your normalization step. Andreas On 8/24/2019 2:31 PM, Van der Merwe, W, Mnr [20076223 at sun.ac.za] wrote: > Hi, > > I am a student at Stellenbosch University currently using the SRILM > toolkit for one of my projects. I would like to know if the toolkit is > able to renormalize the probabilities, given an ARPA file, so that > they sum to 1. I've read the documentation and am aware of the -renorm > parameter option, however, I am not seeking to renormalize backoff > weights, only the probabilities. > > The reason I ask this is that I am writing an ARPA file myself, taking > probabilities produced by a neural network. Because these > probabilities are estimated by a neural net, they tend not to sum not > 1 perfectly. I am hoping that SRILM can correct this. Otherwise, I > will have to write a script to brute force it. > > Werner > > > > The integrity and confidentiality of this email are governed by these > terms. Disclaimer > > > Die integriteit en vertroulikheid van hierdie e-pos word deur die > volgende bepalings bereël. Vrywaringsklousule > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.abrash at sri.com Tue Sep 17 15:34:06 2019 From: victor.abrash at sri.com (Victor Abrash) Date: Tue, 17 Sep 2019 22:34:06 +0000 Subject: [SRILM User List] SRILM 1.7.3 released In-Reply-To: References: Message-ID: <1aeb69d1-01d7-bb80-465e-fd245e2f7f60@sri.com> The latest version of SRILM, 1.7.3, is now available from http://www.speech.sri.com/projects/srilm/download.html A list of changes appears below Functionality * Added nbest-oov-counts script to generate OOV counts for nbest hypotheses. * Added a simple mechanism for weight tying in nbest-rover control files. A system weight of = indicates that it should be tied to the previously listed system. This is useful for reducing the number of free parameters when searching for good system combinations (search-rover-combo). * Add Map_noKey() and Map_noKeyP() for unsigned long long type, to enable use with size_t on Windows MSVC. * Output from -version now includes compile-time options. * Added option ngram -minbackoff to fix up models that have unnormalized probabilities or that are not smoothed. * Added option ngram -unk-probs to override unknown word probabilities. * Added nbest-optimize-args-from-rover-control script, convenient for extracting initialization parameters for nbest-optimize from existing nbest-rover control file. * Added ngram-count -text-has-weights-last option to allow text input with count values at ends of lines. * Added nbest-rover -missing-nbest option to treat missing nbest lists as if an empty hypothesis (no words) had been output, rather than simply skipping that nbest list. * Added nbest-lattice -time-penalty option, implementing a soft constraint on time stamps (when present) during confusion network building and alignment. * Added nbest-lattice -average-times option, to average word times instead of picking the timing of the highest posterior hypothesis. * Added nbest-lattice -suppress-vocab option to disallow certain words in posterior decoding. * New scripts concat-sausages for chaining word confusion networks together. * Added nbest-lattice -dump-lattice-alignments option to output mappings between sausage positions and alignment costs. * Updated Android build for 64-bit development for armv8 using NDK r20 and clang. This almost certainly breaks the 32-bit build for armv7. The last known good 32-bit build is in common/Makefile.core.android.r11c, last built using NDK r11c. To use this, copy Makefile.core.android.r11c to Makefile.core.android. See doc/README.android. Bug fixes: * Added a new tool nbest-rover-helper that combines the functions of the combine-acoustic-scores and nbest-posteriors scripts, doing these computations in double precision and faster. nbest-rover now uses this tool (except when certain options like -nbest-backtrace are used). * nbest-rover strips DOS end-of-line CR characters from the control file, so they no longer mess up the parsing of the file. * Rationalize the way ties are broken when decoding word confusion networks. The word with the lowest internal index is now preferred (and the *DELETEtoken always comes before all other words), unless the new nbest-lattice option -random-tie-break is given. The output order of alternative word hypotheses to sausage files is always by probability rank first, then by internal index. * The reverse-ngram-counts script now replaces with and vice-versa, as required for training reverse-direction LMs, and consistent with reverse-text. * Handle comment lines starting with '##' and empty lines in nbest-rover control files the same way as in File::getline(), i.e., ignore them. * Fixed the syntax for the nbest-optimize -dynamic-random-series options (now starts with single dash, as described in man page). * Don't let compute-best-mix complain about word mismatches if is involved. * Cast input to isspace() to (unsigned char) to guarantee input is non-negative. * Fixed memory management problems in MEModel. * Work around a bug in zlib's gzprintf() printing of very long %s arguments; was causing long word strings not to be output into .gz files. * Removed word string length limit. * Removed limit on total line length in outputting ngram count files. * Zlib updated to version 1.2.11. * nbest-posteriors ensures that bytelog scores are output in fixed-point format. * Allow floating point values when parsing bytelog scores in nbest lists. * Most robustness to word sausages input files that have missing data for some position. * Fixed a performance bug when nbest-rover is invoked with -output-ctm option. -------------- next part -------------- An HTML attachment was scrubbed... URL: