What is spam?
Boris 'pi' Piwinger
3.14 at piology.org
Tue May 11 10:55:37 EDT 2004
"Tom Anderson" <tanderso at oac-design.com> wrote:
>> luckily, it works, and my error rate is in the magnitude of
>> one in a thousand.
>Good for you. I wish I could get that. I still get a virus spam every day or two, sometimes several in a day, usually classified as unsure.
As I said before in other discussions. IMHO unsures are
useful for training, but not for production, since every
unsure is an error (since it needs correction).
>Since I've started using ASNs, it's been getting better.
Maybe you remember my tests which showed, that any use of IP
numbers did not help me. That may well be different for you,
ASNs could be even more useful, but expensive.
>It's hard for that one token to push it up over my cutoff though,
Clearly, with training to exhaustion, this is not an issue.
My database has 768 spam and 353 ham messages and gets more
than 27,000 spam and 17,000 ham messages correctly (plus
outside a security margin!).
>As you can see, some tokens such as "document" and "attached" are hammy, however I doubt I've ever received a ham that said "Your document is attached." And yet, some variation of this (ie "Your file is attached", etc.) is seen in these virus spams all the time. With a Markovian filter, the 3-4 token phrase would be exponentially more relevant than the individual tokens.
Yes, those chains are very promising. Only pairs are not
according to my recent test (and to my surprise).
>Also of note, even though I've stripped out the non-standard headers with spamitarium, it's still largely the "administrative" tokens which make this email seem hammy.
I prefer not to touch anything there. Any information is
>Dates are especially frustrating... I wish bogofilter would ignore them.
Here I agree. Taking them from the References could be
useful. They don't do much harm though.
>Removing "X-Priority", "X-MSMail-Priority", "ESMTP", etc., has helped a bit.
Funny, I would assume they are useful.
More information about the Bogofilter