[cvs] Potential for error?

Allyn Fratkin allyn at fratkin.com
Tue Oct 22 05:02:12 CEST 2002


> > Also, I noticed that there were a lot of words in my lists that weren't
> > words.  Things like ab34af127 would be listed, but only once.  Based on
> > this, eventually the list files will bloat to inifinity.


are you possibly training bogofilter using mailboxes from microsoft
windows, that use CRLF as line endings?  bogofilter up through 0.7.5 is not
recognizing and discarding base64 attachments correctly with CRLF (the CR
is throwing it off).  it is treating them as normal text and parsing the
base64 data as words.  i submitted a fix for this but it didn't make it
into 0.7.5.

my good word db went from 50MB to 3MB after i figured out and fixed this
problem.  i guess i get a lot of attachments.  :-)

by the way, it occurs the me that bogofilter will think any single word
on a line is base64 and discard it, based on the regexp it uses to
"recognize" base64.  i guess this is not too serious until spammers
start sending messages with only one word per line.  :-)

> Similarly, one could periodically discard any tokens whose good+spam
> count is 1.

did you mean good=spam?  i think you would definitely
want to keep a word that only appeared in one of the lists.
-- 
Allyn Fratkin             allyn at fratkin.com
Escondido, CA             http://www.fratkin.com/





More information about the Bogofilter mailing list