How to deal with extremely high spam levels

Bob Vincent bogofilter at bobvincent.org
Wed Jun 23 00:01:00 CEST 2004


On Tue, Jun 22, 2004 at 02:48:21PM -0400, Jake Di Toro wrote:
> I have a similar situation.  But I started with a corpus of 500 each
> that I could train on.

Unfortunately, I blew out my entire email archive by mistake, which is
why I'm in this fix.

> Like you, I filter Tri-State.  But it is
> unclear to me if you are using -u or not.

I am not.

On Tue, Jun 22, 2004 at 03:36:46PM -0400, Tom Anderson wrote:
> I recommend you try out bfproxy
> (http://orderamidchaos.com/bogofilter/bfproxy) for easily handling
> registrations.  It works via a procmail recipe (or similar MDA setup) to
> perform bogofilter functions via email.  You simply drag incorrect
> classifications to a folder in your mail client,

I'm using mutt as my mail client.  I use the bounce (b) command to
send unsures to the proper address to register them as spam or non-spam.

> Bfproxy will also do recursive registrations, so that if after
> registering an email it still doesn't classify correctly, it will
> register it again until it does.  This helps polarize your wordlist
> to your email quicker.  Some will argue that could theoretically
> cause harm, but in practice it works wonderfully.

This is called "training to exhaustion".  No, I haven't been doing
that; I suspect that my "ham" corpus isn't large enough to make it
effective as of yet.

> You could also try out spamitarium...

Dude.  I get over 1,000 spams per day, and I'm filtering them with a
compiled "c" program partly because it keeps my loads well below the
radar of my ISP.  I am NOT going to add a perl script to the mix,
expecially when it loads a new copy of the interpreter for each and
every incoming message.

On Tue, Jun 22, 2004 at 01:07:26PM -0700, Chris Fortune wrote:
> The answer is to collect good email from people's PC's, your friends
> and family will let you do it.  Copy everything in their Sent box
> (under 35kb in size, attachments are useless to you!) to a zip file
> and upload it to your server.  (Make sure they aren't sending spam
> themselves.)

Dubious.  Most of my friends and family have VERY different interests.
Their ham doesn't look anything like my ham.

On Tue, Jun 22, 2004 at 05:10:25PM -0400, Tom Allison wrote:
> I expect that you will find it becoming very good at detecting spam
> to the point where you will start finding most of your unsures are
> actually ham.

No, currently I'm getting several hundred false negatives for each
false positive.

> When I started running bogofilter, it was kind of "dumb" until I had 
> about 100 emails in each category.

Me, too.  But that was a year ago, when I was getting about 100 spams
a day, not 1000.  Spammers have drastically increased their volume
over the past year.

> The other thing you can do to improve your performance, even without 
> bogotune, is to start checking to see what kind of scores you are 
> getting in your unsure and modify the cutoffs to approach those scores.

I've been doing that.  Most of my "unsure" spam is still scoring very
near 0.5.

> I left the 0.5 as unsure because I have some weird relatives who
> send me whacked out stuff sometimes.

I work for an ISP, and my customers send me whacked-out stuff, too.

On Tue, Jun 22, 2004 at 05:24:14PM -0400, David Relson wrote:
> Have you looked at the scores of the two sets of messages?

(checking...)

Errors re-registered as  Ham scored from 0.00 to 0.506930
Errors re-registered as Spam scored from 0.00 to 0.799631

(my cutoffs are currently 0.05 and 0.80)

So yeah, I could probably lower my spam_cutoff to about 0.65 or so...

> Also, are you using the Unsures to train bogofilter so that it can do a
> better job in the future?  This is known as "train on error" and should
> be an ongoing part of using any bayesian spam filter.

Any spams that arrive in my inbox, I bounce to an address which forwards to:

	"|bogofilter -s"

Hams that are marked as unsure, I bounce to a different address:

	"|bogofilter -n"

> If you've just done an initial training, your wordlist may too small to
> fully distinguish ham from spam and that may be the reason you have so
> many unsures.

Haven't done *any* initial training.  Just training on error.  Like I
said, I had an unfortunate accident which wiped out my email spool
(and my carefully trained bogofilter database) and I'm having to start
over from scratch.

--
Bob Vincent




More information about the Bogofilter mailing list