Nigerian spam [was: multiple types of spam]

David Relson relson at osagesoftware.com
Thu Jul 3 16:08:14 CEST 2003


At 09:19 AM 7/3/03, Andrew Pimlott wrote:
>On Wed, Jul 02, 2003 at 09:22:40PM -0700, Max Rible wrote:
> > Most of the 419 mail I get doesn't get recognized as such by
> > bogofilter.
>
>I'd been meaning to do some more research and experimentation before
>writing, but since this came up:  Is this the common experience?
>I've been disappointed at how easily these things slip by
>bogofilter.  When I look at the -vvv diagnostics, it seems clear
>that the reason is that the large number of harmless words (since
>these are long and varied narratives) swamps the spam words.
>
>Paul Graham's articles suggest that he doesn't have a problem with
>these spams.  The difference that immediately jumps out is that he
>bases his scores on only a handful of words.  I haven't seen any
>discussion of why bogofilter uses all words.  It seems to make it
>trivial for spammers to disguise their spam.
>
>Can I throw in another question?  Why do so many scores end up
>within epsilon of .5?
>
>I'm still using 0.12.2, with a default configuration, if it matters.

Andrew,

A newer version of bogofilter may well help.  With version 0.13.0, a number 
of significant parsing changes were added to bogofilter.  Bogofilter now is:

case sensitive, i.e. "PLEASE" and "Please" are now different tokens,
tags subject lines:  "Subject: PLEASE ASSIST" becomes "subj:PLEASE" and 
"subj:ASSIST"
tokens html <img>, <a>, and <font> tags.

All of these changes are of signficant value in scoring messages.

Also, in directory tuning is script bogotune which will scan _your_ email 
to see what the optimal parameters are for running bogofilter at your site.

FWIW, I took a quick look at my stored messages (using "grep president") 
and found 44 Nigerian spam.  Of them, 1 was scored as ham, 24 as spam, and 
19 as unsure.  These messages have been received since last October, so 
many different versions of bogofilter have been used, including all three 
algorithms (graham, robinson, and fisher) and a variety of spam and ham 
cutoff values.

David






More information about the Bogofilter mailing list