Tipping point?

David Relson relson at osagesoftware.com
Fri Nov 14 19:59:40 CET 2003


On Fri, 14 Nov 2003 18:22:24 +0000
Geoff <capsthorne at yahoo.co.uk> wrote:

> Hi,
> 
> I am an enthusiatic but non-statistically-savvy user. 
> Apologies in advance, therefore, if my question is
> laughable.  Here goes:
> 
> Does a point come at which a word (or a small number of
> words), has such a high spam count that its presence
> will result in an email being categorised as spam
> notwithstanding the presence of a few previously
> unencountered words?
> 

Welcome Geoff,

The short answer is "No".  Bogofilter parses the whole message, looks up
each token, and computes the score from that.  There are some parameters
that affect the score given to unknown or little used words and other
parameters that say to ignore words that score close to 0.5.  If you
want to see the individual word scores, use "bogofilter -vvv < message".

> After several weeks of running bogofilter I am deeply
> impressed, but I am puzzled by the fact that the mails which
> are most effective at getting through to my inbox are
> typically not html-bloated monstrosities, but short
> and simple ones, advertising the usual
> pharmaceuticals accompanied by a handful of rare,
> previously unencounterd, words. Running bogoutil -w on the
> pharmaceuticals typically gives a score approaching 2000 (I
> have artificially inflated this by putting some examples
> through bogofilter more than once).

Again, use "bogofilter -vvv < message" to see what bogofilter is
considering.  The FAQ has some info on its output.  The html
monstrosities tend to get caught because they have lots of scorable
tokens for bogofilter to work with.  Short messages can be harder to get
right because they contain few tokens (clues).

> The offending mails are generally categorised as
> spam after one pass through bogofilter .. but the next
> variation will still get through by the inclusion of only
> few new words. Is there any end to this if the spam counts
> on the pharmaceuticals reach some (and if so what) level?

In the long run, 'tis best to train bogofilter with a message only once.

Hope this helps!

David




More information about the Bogofilter mailing list