Bug reading mbox? (was: bogofilter-0.96.5 a.k.a. 1.0.0rc5)

Fri Nov 11 00:44:42 CET 2005

On Thu, 10 Nov 2005 16:43:08 +0100
Boris 'pi' Piwinger wrote:

> Hi!
> 
> I see some problem with the new version. When doing a
> (re)training session with bogominitrain.pl I first go
> through the messages one by one to do the training and then
> check the complete mbox in one run. Until the last version
> it worked without a problem. Now suddenly, I get a lot of
> mistakes (e.g. saying that I have 77 false negatives) when
> checking them one by one there is no false negatives. So it
> looks like -M is producing errors. Sorry, I did not have
> time yet to check the details.
> 
> pi

Hi pi,

Early unicode versions of bogofilter did the conversion to unicode
before decoding (base64 or qp).  _That_ problem was corrected in 0.96.2.
However the fix revealed some problems when image attachments were
decoded and run through iconv (for conversion to unicode).  Long ago
changes were made so that bogofilter would skip binary attachments, but
image and application attachments were overlooked.  _Those_ problems
were fixed in 0.96.3.

There have been no changes to '-M' or related code.  I suspect that
what you're seeing now is the result of the skipping of binary
attachments and the resulting change in tokens generated.  More
information is needed to be sure.

To see how new and old versions of bogofilter are scoring mailboxes,
try the following:

   bogofilter-old -v -M < mailbox > old.out
   bogofilter-new -v -M < mailbox > new.out
   diff old.out new.out

Using bogolexer to see the actual tokens might be better:

   bogolexer-old -v -p -M < mailbox > old.out
   bogolexer-new -v -p -M < mailbox > new.out
   diff old.out new.out

Let us know what you find.

Regards,

David