How to deal with extremely high spam levels
Bob Vincent
bogofilter at bobvincent.org
Wed Jun 23 00:01:00 CEST 2004
On Tue, Jun 22, 2004 at 02:48:21PM -0400, Jake Di Toro wrote:
> I have a similar situation. But I started with a corpus of 500 each
> that I could train on.
Unfortunately, I blew out my entire email archive by mistake, which is
why I'm in this fix.
> Like you, I filter Tri-State. But it is
> unclear to me if you are using -u or not.
I am not.
On Tue, Jun 22, 2004 at 03:36:46PM -0400, Tom Anderson wrote:
> I recommend you try out bfproxy
> (http://orderamidchaos.com/bogofilter/bfproxy) for easily handling
> registrations. It works via a procmail recipe (or similar MDA setup) to
> perform bogofilter functions via email. You simply drag incorrect
> classifications to a folder in your mail client,
I'm using mutt as my mail client. I use the bounce (b) command to
send unsures to the proper address to register them as spam or non-spam.
> Bfproxy will also do recursive registrations, so that if after
> registering an email it still doesn't classify correctly, it will
> register it again until it does. This helps polarize your wordlist
> to your email quicker. Some will argue that could theoretically
> cause harm, but in practice it works wonderfully.
This is called "training to exhaustion". No, I haven't been doing
that; I suspect that my "ham" corpus isn't large enough to make it
effective as of yet.
> You could also try out spamitarium...
Dude. I get over 1,000 spams per day, and I'm filtering them with a
compiled "c" program partly because it keeps my loads well below the
radar of my ISP. I am NOT going to add a perl script to the mix,
expecially when it loads a new copy of the interpreter for each and
every incoming message.
On Tue, Jun 22, 2004 at 01:07:26PM -0700, Chris Fortune wrote:
> The answer is to collect good email from people's PC's, your friends
> and family will let you do it. Copy everything in their Sent box
> (under 35kb in size, attachments are useless to you!) to a zip file
> and upload it to your server. (Make sure they aren't sending spam
> themselves.)
Dubious. Most of my friends and family have VERY different interests.
Their ham doesn't look anything like my ham.
On Tue, Jun 22, 2004 at 05:10:25PM -0400, Tom Allison wrote:
> I expect that you will find it becoming very good at detecting spam
> to the point where you will start finding most of your unsures are
> actually ham.
No, currently I'm getting several hundred false negatives for each
false positive.
> When I started running bogofilter, it was kind of "dumb" until I had
> about 100 emails in each category.
Me, too. But that was a year ago, when I was getting about 100 spams
a day, not 1000. Spammers have drastically increased their volume
over the past year.
> The other thing you can do to improve your performance, even without
> bogotune, is to start checking to see what kind of scores you are
> getting in your unsure and modify the cutoffs to approach those scores.
I've been doing that. Most of my "unsure" spam is still scoring very
near 0.5.
> I left the 0.5 as unsure because I have some weird relatives who
> send me whacked out stuff sometimes.
I work for an ISP, and my customers send me whacked-out stuff, too.
On Tue, Jun 22, 2004 at 05:24:14PM -0400, David Relson wrote:
> Have you looked at the scores of the two sets of messages?
(checking...)
Errors re-registered as Ham scored from 0.00 to 0.506930
Errors re-registered as Spam scored from 0.00 to 0.799631
(my cutoffs are currently 0.05 and 0.80)
So yeah, I could probably lower my spam_cutoff to about 0.65 or so...
> Also, are you using the Unsures to train bogofilter so that it can do a
> better job in the future? This is known as "train on error" and should
> be an ongoing part of using any bayesian spam filter.
Any spams that arrive in my inbox, I bounce to an address which forwards to:
"|bogofilter -s"
Hams that are marked as unsure, I bounce to a different address:
"|bogofilter -n"
> If you've just done an initial training, your wordlist may too small to
> fully distinguish ham from spam and that may be the reason you have so
> many unsures.
Haven't done *any* initial training. Just training on error. Like I
said, I had an unfortunate accident which wiped out my email spool
(and my carefully trained bogofilter database) and I'm having to start
over from scratch.
--
Bob Vincent
More information about the Bogofilter
mailing list