garbage removal

David Relson relson at osagesoftware.com
Thu May 8 20:31:47 CEST 2003


At 02:21 PM 5/8/03, Barry Gould wrote:

>Due to the large size (22MB & 5MB) of my good & spam db's, I decided to 
>try dropping all the words with count=1 as previously suggested.
>
>However, on the spam db, I get:
># bogoutil -d spamlist.db |  bogoutil -l spamlist.db.new -c 1
>bogoutil: Unexpected input [sÛ'Œ] on line 4. Expecting whitespace before count
>#
>
>Those look like non-ascii characters.
>
>Is there another command I can (should?) run to remove garbage like this 
>from the dbs?
>
>bogofilter is 0.10.0
>
>Thanks,
>Barry

Barry,

Bogofilter.cf has a "replace_nonascii_characters" option.  It converts 
characters between 0x80 and 0xFF to '?' (question marks).  I use it because 
it lessens the impact of asian spam.  However, many (most?) european 
languages have accented characters in 0x80-0xFF value range that will also 
be affected.

Bogoutil has a "-n" option that can be used with -d or with -l that will do 
the conversion.

Let me know if this helps!

David

P.S.  There have been lots of changes since 0.10.  In particular, 
bogofilter 0.11 introduced support for multipart mime messages with 
handling of plain text and html text as well as decoding of base64, 
quoted-printable, and uuencoded txt.  There have also been speed improvements.






More information about the Bogofilter mailing list