dazed & confused

David Relson relson at osagesoftware.com
Mon Nov 10 23:41:17 CET 2003


On Mon, 10 Nov 2003 15:37:38 -0600
John McCain <jmccain at layer3al.com> wrote:

> I seem to have seen a significant performance drop between .15.7 and
> .15.8, but there is some question as to whether I am seeing what I
> think I am seeing.  I am using bogotune as a performance metric, and
> the example case I have is running .15.8's bogotune on a database
> created by .15.7 (and getting good results) as opposed to running
> .15.8's bogotune against a database created by .15.8 (and getting bad
> results).

John,

Bogotune is _not_ a performance metric, especially since the C
implementation is fresh new code.  At some future time, it may turn out
to be useful for that, but I don't anticipate that that will be so.

To answer David N Murray's question about 0.15.7 vs 0.15.8, there is one
change that affects parsing and scoring.  

In 0.15.7 (and earlier versions), bogofilter would tokenize mime
attachments like images and applications.  Since those attachments are
binary data (rather than text), parsing them produces sequences of
random characters.  As we're interested in the message (not random
characters), this processing is an unintended error which was recognized
and has been changed in 0.15.8.  When a mime body part has Conten-Type
of application or of image, bogofilter now ignores its token.

With John's data, the application/image modification can be seen. 
0.15.7 creates a longer wordlist than does 0.15.8.  The change also has
an effect on bogotune's result.  Due to my travel and work schedule,
I've not yet been able to pinpoint the reason for the effect on
bogotune.  It _is_ possible that you (David) are being affected by this
as well.

That's what I know at the moment.  I anticipate knowing more, but it may
be several days until then.

I hope the info is helpful.

David




More information about the Bogofilter mailing list