Reducing the size of the training files

David Relson relson at osagesoftware.com
Wed Apr 16 16:58:54 CEST 2003


At 10:04 AM 4/16/03, Shawn Barnhart wrote:

>----- Original Message -----
>From: "Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at>
>
> > My collection of mails for training is growing, now I have
> > about 68/53 megs of ham/spam respctively (20k+/8k+ mails).
> >
> > I observe that I get really huge spam messages in the last
> > time (more then 300k) regularly. Since this is mostly due to
> > image or similar attachments, this is of no use for the
> > training, but I don't want to delete the mails of course. So
> > the idea is to cut down the attachments. Does someone have a
> > script to do this?
>
>I've been getting a number of attachments in the 200-300k range, the English
>language versions claiming to be some kind of internet security patch.  It's
>actually a virus (W32.Gibe at mm).
>
>It'd be nice if bogofilter *could* use attachments for the training process,
>or at least the strings contained in the attachment.
>
>I know there's processing overhead, but perhaps it could at least be an
>option.

Shawn,

If you want to run an experiment, in token.c at line 130 (approx) there's a 
"continue" statement, whose purpose is to skip the innards of 
attachments.  Comment it out; see if bogofilter does better or not; let us 
know.

David






More information about the Bogofilter mailing list