Bogofilter-0.15.9 - New Current Release

David Relson relson at osagesoftware.com
Mon Nov 24 22:23:04 CET 2003


On Mon, 24 Nov 2003 21:08:49 +0000
Richard Kimber <rkimber at ntlworld.com> wrote:

> On Mon, 24 Nov 2003 15:17:32 -0500
> David Relson <relson at osagesoftware.com> wrote:
> 
> > Retraining is necessary only when there's a major lexer change and
> > that doesn't happen very often.  If upgrading from 0.15.5 or newer,
> > there's no need in retraining.  If upgrading from 0.15.4 or older, 
> > retraining would be valuable as it would add "head:" prefixes for
> > header line tokens (that aren't otherwise being tagged).
> 
> Thanks.  It's just that when I see:-
> 
> * Lexer changes reduce size of bogofilter executable by approx 90%.
> * Lexer.c no longer discards X-Bogosity lines in rfc822 attachments.
> * Removed repetition counts in lexer for TOKEN and MIME_BOUNDARY
>   patterns to reduce executable size.
> 
> it's not clear to me what's involved.  I'm not really in a position to
> judge major and minor changes, which is why I appreciate guidance.

Hi Richard,

A good question.  The short answer is that we got lucky and discovered
that some of the constructs used in the lexer lead to bloated parsing
tables.  For example, consider the following patterns:

[[:digit:][:alpha:]] will match a single letter or digit
[[:digit:][:alpha:]]+ will match one or more letters or digits
[[:digit:][:alpha:]]{3,70} will match from 3 to 70 letters or digits

Turns out that the last pattern does something like include 70 copies of
the digit/letter pattern in the lexer code.  Since bogofilter's
get_token() routine rejects tokens longer than MAXTOKENLEN (currently
30), having the lexer deal with length is not necessary.  Removing the
{3,70} construct didn't have any effect on what bogofilter _does_, but
has a big effect on bogofilter's size.

The other changes will return slightly different tokens from a message. 
For example "head:X-Bogosity" may now be generated and tokens no longer
end with apostrophes or backticks, i.e. ' or `.  As best we can tell,
these changes have an insignificant effect on a message's score.

On my good days, I make accurate use of phrases "major change" and
"minor change".  ... and then there are the other days :-(

Ciao,

David





More information about the Bogofilter mailing list