garbage removal
David Relson
relson at osagesoftware.com
Thu May 8 20:31:47 CEST 2003
At 02:21 PM 5/8/03, Barry Gould wrote:
>Due to the large size (22MB & 5MB) of my good & spam db's, I decided to
>try dropping all the words with count=1 as previously suggested.
>
>However, on the spam db, I get:
># bogoutil -d spamlist.db | bogoutil -l spamlist.db.new -c 1
>bogoutil: Unexpected input [sÛ'] on line 4. Expecting whitespace before count
>#
>
>Those look like non-ascii characters.
>
>Is there another command I can (should?) run to remove garbage like this
>from the dbs?
>
>bogofilter is 0.10.0
>
>Thanks,
>Barry
Barry,
Bogofilter.cf has a "replace_nonascii_characters" option. It converts
characters between 0x80 and 0xFF to '?' (question marks). I use it because
it lessens the impact of asian spam. However, many (most?) european
languages have accented characters in 0x80-0xFF value range that will also
be affected.
Bogoutil has a "-n" option that can be used with -d or with -l that will do
the conversion.
Let me know if this helps!
David
P.S. There have been lots of changes since 0.10. In particular,
bogofilter 0.11 introduced support for multipart mime messages with
handling of plain text and html text as well as decoding of base64,
quoted-printable, and uuencoded txt. There have also been speed improvements.
More information about the Bogofilter
mailing list