strange symbols

Andreas Stolcke stolcke at speech.sri.com
Tue Jul 29 23:27:13 PDT 2008


They look like ASCII control characters (character values < 0x20).
You need to do a better job filtering your training data.

--Andreas

In message <79a042480807291803w44eb15c8ic8d4c4ef4a0e8182 at mail.gmail.com>you wro
te:
> 
> Dear all,
> I'm using srilm on some data crawled from the Web. The lm contains some
> strange symbols as these:
> \1-grams:
> -6.774207       ^A      0
> -6.774207       ^C
> -6.774207       ^D
> -6.774207       ^E      0
> -6.774207       ^F      0
> -6.774207       ^G      0
> -6.774207       ^H      0
> -6.774207       ^K      0
> -6.774207       ^N      0
> -6.774207       ^O
> -6.774207       ^P
> -6.774207       ^T      0
> -6.774207       ^X
> -6.774207       ^Y      0
> -6.774207       ^\
> -6.774207       ^]
> -6.774207       ^^      0
> -6.774207       ^_
> 
> these symbols are not the simple combination of ^ and a letter but it seems
> to be something different as a character that has been truncated or
> something similar.
> Do u have an idea what they are and how to remove them?
> 
> thanks a lot
> Marco
> 
> ------=_Part_45139_5077409.1217379820193
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> <div dir="ltr">Dear all, <br>I&#39;m using srilm on some data crawled from th
> e Web. The lm contains some strange symbols as these:<br>\1-grams:<br>-6.7742
> 07&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^A&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>
> -6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^C<br>-6.774207&nbsp;&nbsp;&nbs
> p;&nbsp;&nbsp;&nbsp; ^D<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^E&n
> bsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>
> -6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^F&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
> ; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^G&nbsp;&nbsp;&nbsp;&nbs
> p;&nbsp; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^H&nbsp;&nbsp;&nb
> sp;&nbsp;&nbsp; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^K&nbsp;&n
> bsp;&nbsp;&nbsp;&nbsp; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^N&
> nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb
> sp; ^O<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^P<br>-6.774207&nbsp;
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^T&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>-6.77420
> 7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^X<br>
> -6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^Y&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
> ; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^\<br>-6.774207&nbsp;&nb
> sp;&nbsp;&nbsp;&nbsp;&nbsp; ^]<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
> p; ^^&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>-6.774207&nbsp;&nbsp;&nbsp;&nbsp;&nb
> sp;&nbsp; ^_<br><br>these symbols are not the simple combination of ^ and a l
> etter but it seems to be something different as a character that has been tru
> ncated or something similar.<br>
> Do u have an idea what they are and how to remove them?<br><br>thanks a lot<b
> r>Marco<br></div>
> 
> ------=_Part_45139_5077409.1217379820193--




More information about the SRILM-User mailing list